Last week there was a very public service outage in Amazon Web Services (AWS) cloud infrastructure that took down a large number of web sites and services. This post at the High Scalability blog is a very comprehensive list of articles and posts covering the AWS outage from all possible angles.
So many great articles have been written on the Amazon Outage. Some aim at being helpful, some chastise developers for being so stupid, some chastise Amazon for being so incompetent, some talk about the pain they and their companies have experienced, and some even predict the downfall of the cloud. Still others say we have seen a sea change in future of the cloud, a prediction that’s hard to disagree with, though the shape of the change remains…cloudy.
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
Deliberately destabilising a live system to improve its robustness is counter intuitive but it seems to have paid off for NetFlix.