Careful with Caching

8739999065_3ac1ae95b9

Last week I had a discussion with our head sysadmin Marko on why I'm not particularly fond of caching. My answer to him might be of interest to others and it's also an opportunity to validate my arguments (so please share your opinion in the comments). My biggest beef with caching is that it is too easy to do it incorrectly and quite hard to do it properly. The first though of a LAMP stack programmer upon experiencing performance problems is to set up a memcache server somewhere, add a few lines of code and all his performance problems magically vanish. Well, only until your servers crash amidst a heavy load and you're trying to bring them back on during cache stampede. Synchronizing requests so that only one of them fetches data while others wait for the population of cache, requires quite a complex logic that is unfortunately never implemented.

The second big issue I have with caching is the lack of transparency on what data the code is actually operating on. Cache is a very dynamic structure that doesn't keep history and it's therefore quite hard to reason if the problem lies in the (stale) date or buggy code. Extensive caching of partial results further exacerbates the problem.

Caching should be used only when usage pattern is unknown. That is, if you don't know which of your million web pages will be most used, it makes sense to implement caching to handle "hacker news" effect. But if you have just a dozen static pages or you know in advance which pages will be the most popular, it makes much more sense to pre-generate them and serve static files. Doing so prevents cache stampedes and make it very transparent what kind of data is actually served to the user. At first blush, such solution seem more complex than setting up a memcache server. But in practice, it involves much less work to be implemented properly.

Enhanced by Zemanta