Optimize also for MTTR not just MTBF


Yesterday we had 20 minutes of service downtime. It was typical upgrade induced malfunction that was undesired side effect of migrating our Apache Cassandra cluster to new hardware. Due to awesome response of our fabulous SysAdministrators we managed to bring everything under control very fast and without any permanent damage. Even though we have been successfully using Cassandra without any major problems for more than a year now, Cassandra is still considered an experimental technology. Therefore yesterday's outage immediately brought back ideas to resurrect the "good old MySQL". As with any other conservative ideologies, the problem of such thinking is the selectivity of human brain to remember only the good experiences while quickly forgetting the bad. There were (and still are) very good reasons why we have decided to introduce Cassandra to our technology stack, with the main reasons being reduction of MTTR (mean time to repair), horizontal scalability of reads, and reliable writes.

In particular MTTR is the main problem of MySQL. Having a central SQL database is a very convenient proposition, but only until it fails. Bringing back a failing database under the pressure of an on-going blackout is one of the most traumatic experiences of Operations people. Since Cassandra (and other NoSQL datastores) is much more distributed and has simpler data structures it is much easier to repair than is the case with SQL databases. This was proven also in our case where our Operations guys managed to bring back Cassandra to life in less than 20 minutes. Great job, guys!