SIGIR'12: Beyond Bag-of-Words


I'm in Portland, Oregon this week, where I'm attending SIGIR conference. SIGIR is the premiere conference on information retrieval in the world, which makes it very relevant to what Zemanta is doing. The conference has started for me with a tutorial on Machine Learning for Query-Document Matching with a title "Beyond Bag-of-Words". The embarrassing secret of the field of Information Retrieval is that even after 50 years of research and billions of dollars spent the simple bag-of-words representation that concerns itself only with words and their frequency, while discerning word order altogether, still performs on par or better than all other, more recent methods.

The goal of "Beyond Bag-of-Words" tutorial was therefore to review state of the art in methods that search for documents not only using terms, but also on higher semantic levels of phrases, senses, topics, and structure. While the tutorial had a slow start in the morning, at the end it gave a nice overview of different matching methods that go beyond BM25. In particular the tutorial presented matching with dependency, topic, and translation models, by query reformulation, and in latent space. If you found this blog post through any of the big search engines, it will be interesting for you to know that the results that the search engine provided were delivered also using these methods. You might be disappointed a bit, though, to learn that none of these method provides substantial improvement in relevancy of results over good old BM25.

The one comment I have on the tutorial is that the authors should really improve their presentation style and in particular their command of the English language.