The idea is to :
- Get all the synopses for all the movies from Fantasia's website (using Pattern)
- Clean the text so it's in a format suitable for our need (using NLTK and/or basic Python functions)
- Train a linguistic model to find similarities between documents (using Gensim)
The linguistic model is where all the magic happens. Say I would like to see a zombie movie. Of course, I could search for a film where the string "zombie" appears in it's description, but I would miss all the ones where these creatures are described as "undead" instead. The key here is to find out which topics are in the documents. A topic is a weighted distribution over words, so my "zombie" topic would be a list of the words in all the movie descriptions with a high weight associated with the words that are representative of the topic ("zombie", "undead", "horror", etc.) while other words would have a lower weight. I use latent semantic indexing (LSI) to create these topics. It's so easy to do with Gensim that it takes literally one line of code (five if you count a bit a pre processing with TF-IDF). Once we have the topics, we just have to find which movie description have a similar topic distribution to the film we wanted to see.
How is it done concretely?
This was the subject of my talk at PyLadies in July and you can have all the code from the demo here (as an iPython Notebook on Gist). Please feel free to fork this project and adapt it to fit your interests! Which other songs have similar lyrics to your favorite song?