Zero Tolerance to Errors


A complex system, such as Zemanta's, consists of so many components that at least some of them are not operating optimally at any given time. If a system is instrumentalized with lots of measuring and logging (as it should be), you'll soon become overwhelmed with error reports and charts indicating failure or degraded performance. Without a culture of zero tolerance to errors, problems get swept under the rug and they explode at the most inappropriate times (with Saturday evenings, just before launch, or when signing-up important client being the favorite moments for failures to happen). I've heard at Zemanta that "it's normal for that error to occur" or "that chart is always off" way more times than I'd like to. Before joining Zemanta I worked at, where such statements would never pass without somebody yelling at somebody, so I know that we could do better. I wish we would develop zero tolerance to errors also at Zemanta, thus acting proactively and not always just defensively. Only recently, we had a failure that could be well prevented in advance. One of our more important servers was indicating for quite some time that one of the disks in RAID failed. We didn't replace it on time and when we were moving the server to a new co-location also the second disk failed, wiping out the system along the way. We had a backup of course, but still the recovery of that part of the system took three days, while it could easily be prevented in advance.

Enhanced by Zemanta