Apr 23, 2011

What The Amazon Outage Means For Startups

My blog, which is hosted on Amazon EC2, was down for about 18 hours due to an infrastructure issue at Amazon. I wasn’t alone. Quora, Reddit, and others were down as well.

When I started my blog, it was hosted on a server at Godaddy.com where I had registered my domain name. When I decided I was going to switch, I could have moved to a hosted platform. But I wanted to try out EC2 for a variety of reasons – not because I needed to run on EC2 but because I wanted the first-hand experience with what I consider a fundamental building block of today’s startups – Amazon’s cloud infrastructure. Plus, it was fast, inexpensive, and easy to get started.

There is, however, no button for “make this reliable,” aka, take this application (database, pages, IP address, logins, etc.) that I put on this instance, and make it work no matter what, regardless of whether my instance, storage, or network goes down. And if something bad happens, like a virus, database corruption, or security attack, get me back to where I was before things went bad, and without the bad stuff.

The real issue is that building scalable applications is difficult. People go to EBS because it is supposed to be more reliable, but one thing the Amazon outage showed is that it isn’t. And that means that applications themselves have to be built in a reliable way, or infrastructure vendors have to make it easier and cheaper to obtain reliability through failover and redundancy. The problem is, that’s a hard problem.

The problem is, when did things go bad and what was the cause? Was what looked like a little spike in database and CPU usage actually a bad bug infecting the system? Was it someone who got ahold of the admin password and hacked their way in? Or was it something lower level, like a disk error?

As with the Amazon failure, what should I do? Should I try rebooting? Should I restore a snapshot (whoops… the snapshot is in the same zone as my instance and primary storage and until some changes were made, I couldn’t restore across zones). If I take any of those actions will it improve things or just make them worse?

The Ideal Scenario
What I’d really like is to put my application (in this case, WordPress) out on my instance and know that somehow, magically, it is being deployed reliably, and if anything does go seriously wrong at the application level, I can get back to the point a little while before things went bad, perhaps even changing usernames and passwords before it comes online. In theory, that’s what the cloud is supposed to deliver: highly reliable, on-demand compute and storage capacity. That’s easier said than done. The ideal scenario simply isn’t possible today.

In fact, I really don’t want to have to deal with instances, storage, and networks at all. On Amazon, I have to choose between small, medium, and large instances. But what I really want is to be able to say, give me another unit of compute, another unit of storage. If CPU usage is high, give me another few units on demand and also notify me that something’s going on so I can take a look. And I want all that with the security that separate instances provide: one instance doesn’t corrupt another, apps can be separated from each other, databases can be separated from each other; and my points of failure are reduced.

That in fact, is what EBS is supposed to provide, when it comes to storage. The problem is when the underlying infrastructure has a problem it affects everything running on top of it. What is supposed to be more reliable becomes less so – a single point of failure.

What I need is a hybrid architecture that delivers the reliability of isolation and geographic, network, and hardware diversification, but with the ease of use of true scalability; when I want more units, I get them. But if a piece of underlying infrastructure goes down, it doesn’t impact me. All at the same price I currently pay (or perhaps at a very slight premium).

Throw in easy backup and restore, and a button I can push that will put up a “this site is temporarily down for maintenance” page (hosted elsewhere but without me having to pay for a whole other instance), plus the ability to bring up my new stuff while leaving the old stuff intact until I’ve verified that the new stuff works, and a built-in “scalability” simulator that takes a copy of my entire site, back-end, and infrastructure, right down to the hardware, and tests it at all the different points of failure, and you have a winner.

Sure you can pull a number of pieces together to do all this, but a one-click solution available to web sites large and small? Like I said, easier said than done.

What the Amazon Outage Taught Us
What the Amazon outage really taught us is that being down for more than few hours of late night scheduled maintenance is simply unacceptable for today’s real-time startups. Users don’t care that your infrastructure provider had an issue: they blame you. With users and customers spread all around the world, there is no window longer than a couple hours when it’s acceptable to be down. And even that is pushing the limit. Frustrated, users write permanent posts and tweets, while your numbers suffer. The impact lasts far longer than the time your site was down.

Would those startups that were impacted by the outage be better off running their own hardware and data centers? Not necessarily, for all the reasons they didn’t do that to begin with. It’s expensive, it’s a lot of fixed cost, and it requires significant up front capital expenditure. It also requires not only software but hardware expertise as well – actual data centers and servers, a near constant turnover of old hardware for new, and long-term data center contracts.

No doubt startups that were impacted will be revisiting their architectures, changing their redundancy approaches, and accelerating those long-planned yet oft put-off re-architecting projects that will make them truly scalable and redundant. They may even put in more simulation tools to help them test usage spikes, infrastructure failures, security attacks, and data corruption.

But what they’ll really be hoping for is a solution that gives them the flexibility to scale with the reliability that comes from being isolated from other users of the same infrastructure, down to the physical level, at a price they can afford.

Perhaps after the finger-pointing is over and the black-eyes are done being iced, the greatest outcome will be innovative new technologies, which will eventually trickle all the way down to today’s single instance, single application user, whether it be one of millions of small business owners hosting web sites, or a blogger like me.

3 Comments

  • Hi David,

    What you are asking for, a single instance, single application user to have high availability isn’t really possible. This is doubly so with n-tier applications like your example application, WordPress. If you have a single point of failure then eventually, there will be an outage.

    For example, I’m not aware of a single PaaS provider of any kind that provides a geographic (or even near-geo) business continuity plan as part of their core service. There are a few that help you out intra-cluster. But, when you loose a DC, Building, or City block that doesn’t help much.

    The bad news is that applications themselves have to be re-architected to make use of the primitives like EBS, AWS EC2 AMI’s, Multiple AWS AZ’s, and much more. This requires re-thinking the way software is written and potentially using a few different tools.

    I wrote a couple of articles recently along these lines and many more over time about this here if your interested:

    Cloud Native Applications
    http://www.productionscale.com/home/2011/4/17/cloud-native-applications.html

    Clouds and SPOF’s
    http://www.productionscale.com/home/2011/4/22/on-clouds-and-spofs-or-the-great-aws-outage-of-april-2011.html

    One very important thing, it is the application owners responsibility to make things highly available if a bit of downtime ( or a lot ) will break the bank.

    I’ve been building websites for a while now and the number of businesses that wanted to “do it right” have been very few. They always cut costs when it comes to eliminating SPOF’s that reduce potential availability or take the easy road. The good news? I think this is finally changing! Finally.

    To that end I’m excited to actually be building at least one application for a client that will have a shot at being called a Cloud Native Application when it’s done! Good times!

  • Disk storage vendors developed RAID 1….6, with various levels of mirroring, data redundancy, and ability to handle when one or more drives went bad.

    The cloud service providers need to develop the equivalent for hosting. It *will* impact performance, but that can be managed for sites that do not require transaction-level accuracy. (i.e., if this comment is lost, will anyone care?)

    Beyond drives, it involves drive clusters, servers, networks both inside and outside the cloud center, and databases.

    This is not something a start-up wants to tackle, but the big boys playing in this field have the capability to make it happen. Their customers should start demanding it.

    N.B. I worked on making virtual storage simple for the masses at 3PARdata.

  • [...] availability. Finally, the platform has to stay up. When Amazon suffers an outage, when Twitter is temporarily unavailable, it’s not just users that are [...]

Leave a comment