At Zemanta we love Apache Lucene/Solr. Our recommendation engine is built upon it and in our experience, Lucene/Solr is a very fast and solid piece of software. While our index is not that big (several tens of milions of documents), our queries are quite complex. Since our users expect their recommendations returned within a couple of hundredths of milliseconds, our greatest scalability issues are response times. For now, we have avoided the need for sharding by implementing a custom extension to Lucene/Solr that enables us to search the index using multiple cores, where each of the cores processes different part of the index simultaneously. This solution provided us with shorter response times without the need for dealing with index partitioning. But our index is growing faster than the number of cores we have available in our servers. Therefore we plan to start using SolrCloud this year, so that we continue to provide fast response times while being able to greatly increase the pool of news articles and blog posts that we can recommend to our users.

The New SolrCloud: Overview

Just the other day we wrote about Sensei, the new distributed, real-time full-text search database built on top of Lucene and here we are again writing about another "new" distributed, real-time, full-text search server also built on top of Lucene: SolrCloud.