Last week a colleague of ours dropped a keyspace on our Cassandra production cluster by mistake. Fortunately, this wasn't a big deal, because Cassandra makes a snapshot of data before actually dropping the keyspace, so only one hour of data was lost and we didn't need to restore the cluster from a backup copy. But bringing back keyspace from a snapshot requires consistency checking which consumes quite some resources and substantially slows down Cassandra. In our application, we have several time-outs set, to give up on a request if it takes too long. With Cassandra being slower than usual, quite many requests timed out and reported an error. When an error occurs in our system, besides logging it we also send an email to developers and system administrators. This usually works very good, since it forces us to take all errors seriously. But when a major outage happens, as was the case for dropped Cassandra keyspace, we get flooded by error emails, rendering our email communication thru Zemanta e-mail addresses effectively useless. With hundredths of concurrent requests per minute failing and sending error reports, my inbox got swamped in several tens of thousands of emails. To my great surprise Gmail eventually managed to successfully deliver all the emails, but still I received my normal emails with several hours of delay.
While this outage showed that our system is quite resilient to failures (users didn't notice anything), it has shown that our error reporting functionality isn't adequate. We are still contemplating how to best solve the error reporting problem. So if you have any advice to share with us, we would be more than happy to listen.