Tracking Memes in the Wild, Part III

Listen

In my last two posts, I wrote about the limitations of one method for tracking memes and the promise of second method. That latter method, in brief, was to tag each meme-containing post with a unique text string that could enable you to use a search engine as an aggregator. But how can this be turned into a useful service?

The first challenge we have to tackle is a way to generate strings that are long enough to be unique even if our service is highly utilized yet short enough that they won’t fill up half of the screen real estate available for a given post. This one’s easy; it’s called a URL. Let’s imagine that Google is providing this service. You would enter the name of the meme for which you want to create a tracking tag and Google would spit back a unique tracking URL that you could link to the name of the meme in the post, or to some “meme tracker” graphic, or anything else you’d like.

Now, how would you let the service know that a new instance of the meme exists on the web? Not everybody has trackback, XML-RPC, and what have you. The lowest tech method I can think of (albeit somewhat more resource-intensive than, say, a RESTful API) would be through referrer logs. When a person posts a new instance of the meme on a page, s/he would just click on the URL to go to the tracking page. The service would compare the URL of the referring page to all known instances of the meme and, seeing that the URL is a new instance, catalogue it.

So far this is kinda nice but not yet what I’d consider to be super-cool. You can tag arbitrary content on the web, you can do it in a low-tech way to make it easy for everyone to do, and you can allow people who know about the service to submit their instances to the service for tracking very easily. But there are a couple of problems that this service doesn’t solved yet. How do you find instances that people haven’t tagged? How do you deal with overlapping meme labels?

The answer to these two problems is simple: Bayesian analysis. Once you have built up a corpus of tagged meme instances, you run a Bayesian filter similar to the adaptive systems used in good junk mail filters these days. The system can begin to recognize the common characteristics (i.e., words, phrases, and other text strings) to the various meme instances. It can tell you what those characteristics are. And it can construct a web search using those characteristics. It would even be fairly easy to put in a slider control, much like the agressiveness slider in spam filters, to be more or less choosy about how wide a net the meme search should cast. Likewise, it would be fairly easy to run a comparison of two meme labels to see if there is overlap in the corpi. A “meme mapping” tool could tell you, for example, that there is an 85% overlap in terminology between the “idea virus” meme and the “meme” meme and even allow you to merge the two (or map the overlap and differences).

There would only be one more piece that we’d need to make the tool complete. Remember, we want to study meme propagation which means we need to know how it spreads over time. So we’d need a tool that gathers a bit more meta-data about the meme instances and their relationship to each other on the network and over time. Specifically we’d need to know

When each meme instance was published (to the best of our ability to determine)
Whether that instance is linked to previous instances
Whether the site in which the instance appears is otherwise linked to sites that contain previous instances

With this data, we could at least start making educated guesses about how particular memes spread, how fast they spread, the nature of the network in which they spread, and so on. This piece is beyond my extremely limited technical knowledge to plan out, but I would imagine that a relatively simple spider could gather all this info. And with this last piece in place, you’d have a fairly complete meme-tracking service that requires very little technical ability or infrastructure from the end users and relies on fairly basic and well-understood technologies.

Note to hackers: You missed my birthday this year, but there’s still plenty of time before Hannukah…

By Michael Feldstein

More Posts(481)

By Michael Feldstein

Disclaimer