Friday, July 3, 2009

Out of the bottle

Jinni is the best thing I've ever found on the internet. Simply the best.

Before we go on, I'd like to ask you to register at Jinni. It's completely FREE (for now at least).

Now, after you registered and certainly played around a bit you have probably noticed how amazingly accurate it is. How perfectly calculates what you want to watch next.

This post won't be a long drooling appraisal of the service. I'd like to tell you how it's done. Behind The Scenes. You know, what those geeks did behind the curtains.

It's actually pretty simple: it is called "classification". It is a very commonly used technique for Data Mining, Text Mining, etc.

The problem it solves is very obvious: You have things to put in boxes with a bunch of attributes (size, color, name, etc). You want to put similar things to a box. But you don't want to all that by hand. So why not use a computer?

In this particular case it is a bit more complicated than that. They wanted to "tag" the movies using a computer. They call these tags Movie genes.

First they probably took quite a lot of movies. Then probably for attributes they gathered reviews, plots etc. (a ****load of text). This text became the attributes of the movie. Then they tagged at least 300 movies, but my best guess would be around 1000. By "them" I mean a bunch of film professionals, "experts", "movie nuts", whatever. Each of them tagged them individually, and then they sat down together and spoke it over and reached a conclusion.

That conclusion (bunch of movies with their reviews and their tags) is called the "training set".

Then they grabbed an Artificial Intelligence (really, I'm not kidding) and taught it how to tag movies using that training set. Geek note: I'm guessing they used the very common tf*idf weighted n-grams, NER tools and handwritten rules combination.

Most certainly they tried out a lot of things before reaching the best possible "tagger machine". It can be a struggle.

Keep in mind, these techniques are not that rare, but the evidence to their power is obvious.


Now, they have 2 very interesting things: The Search and the Recommendations.

The Search is based upon recognizing what genes you are looking for, or what text you're trying to find. It's not that complex to calculate these things nowadays.

The Recommendations aren't more hard to get, Basically they take you're "interest vector" (bunch of numbers showing how much you're interested in a tag), you're Neighbourhoods interest vector and your Movie Circle's interest vector, mash them together and just calculate "how far the movies are from your interest". They just show you the nearest ones.


The question is: Would you consider "taste changes" ? People tend to change by time and I'm not quite sure that I would rate "Awesome" Winny The Pooh again, in spite of my love for it when I was little.

In Geek (not greek :D): Particulary I'm talking about weighting interest ratings by date of the rating.