The Curious Case of Solr Malfunction

For the past several days our blog induction service did not function properly. BlogSpire relies on accessing our index to retrieve all the news articles and blog posts of the past 24 hours. But in recent days index was returning response that was cut at some arbitrary point with no exceptions being thrown anywhere. Since we did some upgrades of our index just before this had started to happen, our first inclination was to revert to the previous version of the index. At first it seemed that this solved the problem but by the next morning errors in fetching articles from the index had started to pile again. I strongly suspected problems at the network level, but our Ops people could not find anything suspicious there. Just about that time we excluded the index server that was serving BlogSpire requests from the production cluster and the problems were gone immediately. But when we start banging this server with some other heavy load, the problems were back. I had started to notice a strong correlation between heavy load on the index server and errors in accessing it from the BlogSpire. Once you are able to repeat the defect, the search for exact reason is just a matter of patience and ingenuity. With the server under heavy load, we first excluded python as the source of problems since the same malfunction happend also when we queried Solr using curl. The next in line was checking the network layer. By utilizing Idioterna's network sniffing capabilities we quickly discovered that the broken result was coming already out of Solr. Unfortunately, there was no indication of any malfunction in Solr except for the "Broken Pipe" notification that the client has closed the connection. We have never experienced such a problem with Solr in our main api, but there we use xml for marshalling instead of json, as is the case in BlogSpire. This provided the final piece of solution. Once we have employed xml also in BlogSpire, the problems miraculously disappeared.

It seems that json marshalling in Apache Lucene/Solr (at least in version 3.1.0) has some serious problems under heavy load. Unfortunately, we don't have time to find the real culprit of this issue. Instead, we have just circumvent the problem by replacing json with xml. But if somebody has experienced the same problems and has found the true reason of this misbehaviour, I'd really like to know more about it.