What do you do if you have a collection of documents and you want to match them? One option is to develop a service as powerful and complex as Zemanta. But what technique would you use if you would like to do it in the simplest possible way without relying on any external service or software? I was confronted with this challenge recently and, of course, I couldn't resist it. The result is a simple recommendation engine that I'm presenting in this post. The approach I've taken in implementation of the simple recommendation engine is to first autotag all the posts and then do matching by identifying posts with the most autotags in common with the query post. In effect, I've reduced the problem of matching to the problem of auto-tagging.
Zemanta already provides very good auto-tagging functionality, but only as a service. The technology behind this service is way too complex and resource intensive to be used as a standalone software. Instead of using the full Zemanta service, I've used a collection of three million documents auto-tagged by Zemanta to learn a set of English words which are especially suitable/popular for tagging. I've come with a list of 18K such words that is available here. I have considered a word suitable for tagging if it was used by Zemanta auto-tagger at least 25 times. The resulting list of suitable words is ordered by the ratio between number of times the word was used as a tag by Zemanta auto-tagger and total number of document occurrences of the word. For example, the word django occurs in 1863 out of 3 million documents and it was used 858 times as a tag for these documents. The resulting score of the word django is 858/1863 = 0.46 which makes this word 222nd most suitable word for tagging.
I haven't done extensive performance evaluation of the engine, but at first glance its performance seems very nice. I've included three evaluation datasets that can be used to fine tune the parameters of the engine. Please try engine yourself by cloning/downloading the code from GitHub and issuing the following command
python sre.py avc_blog.json unigrams.csv
Let me know in the comments, what do you think of it.