Defensive System Design


Yesterday we had an outage in a part of our system. Our datacenter provider lost network connectivity due to severed cables, while the alternative route got saturated which resulted in packet loss. Fortunately, we've designed our system in a defensive way so end users didn't notice anything and only our revenue took a slight hit. The essence of defensive design is to build your system out of distributed, independent, and layered components. In our case we rely on local installations of our WordPress plugin and distributed nature of Amazon CloudFront to provide the first layer of our service that works even if Zemanta stops operating down completely. Further on, our server infrastructure is again layered in such a way that basic functionality works even if more complicated systems are failing and is able to return meaningful, if degraded, results.

In our experience the best way to achieve a distributed, independent, and layered system is to start from something simple and local, and iteratively add more complicated funcionality, while making sure that initial functionality is not dependent on more complex functionality added later. This is similar to the philosophy of evolution of biological processes that states that one should never refactor if you can incrementally add another mechanism, no matter how complicated.

Enhanced by Zemanta