Browsing articles from "April, 2011"

What The Amazon Outage Means For Startups

My blog, which is hosted on Amazon EC2, was down for about 18 hours due to an infrastructure issue at Amazon. I wasn’t alone. Quora, Reddit, and others were down as well.

When I started my blog, it was hosted on a server at where I had registered my domain name. When I decided I was going to switch, I could have moved to a hosted platform. But I wanted to try out EC2 for a variety of reasons – not because I needed to run on EC2 but because I wanted the first-hand experience with what I consider a fundamental building block of today’s startups – Amazon’s cloud infrastructure. Plus, it was fast, inexpensive, and easy to get started.

There is, however, no button for “make this reliable,” aka, take this application (database, pages, IP address, logins, etc.) that I put on this instance, and make it work no matter what, regardless of whether my instance, storage, or network goes down. And if something bad happens, like a virus, database corruption, or security attack, get me back to where I was before things went bad, and without the bad stuff.

The real issue is that building scalable applications is difficult. People go to EBS because it is supposed to be more reliable, but one thing the Amazon outage showed is that it isn’t. And that means that applications themselves have to be built in a reliable way, or infrastructure vendors have to make it easier and cheaper to obtain reliability through failover and redundancy. The problem is, that’s a hard problem.

The problem is, when did things go bad and what was the cause? Was what looked like a little spike in database and CPU usage actually a bad bug infecting the system? Was it someone who got ahold of the admin password and hacked their way in? Or was it something lower level, like a disk error?

As with the Amazon failure, what should I do? Should I try rebooting? Should I restore a snapshot (whoops… the snapshot is in the same zone as my instance and primary storage and until some changes were made, I couldn’t restore across zones). If I take any of those actions will it improve things or just make them worse?

The Ideal Scenario
What I’d really like is to put my application (in this case, WordPress) out on my instance and know that somehow, magically, it is being deployed reliably, and if anything does go seriously wrong at the application level, I can get back to the point a little while before things went bad, perhaps even changing usernames and passwords before it comes online. In theory, that’s what the cloud is supposed to deliver: highly reliable, on-demand compute and storage capacity. That’s easier said than done. The ideal scenario simply isn’t possible today.

In fact, I really don’t want to have to deal with instances, storage, and networks at all. On Amazon, I have to choose between small, medium, and large instances. But what I really want is to be able to say, give me another unit of compute, another unit of storage. If CPU usage is high, give me another few units on demand and also notify me that something’s going on so I can take a look. And I want all that with the security that separate instances provide: one instance doesn’t corrupt another, apps can be separated from each other, databases can be separated from each other; and my points of failure are reduced.

That in fact, is what EBS is supposed to provide, when it comes to storage. The problem is when the underlying infrastructure has a problem it affects everything running on top of it. What is supposed to be more reliable becomes less so – a single point of failure.

What I need is a hybrid architecture that delivers the reliability of isolation and geographic, network, and hardware diversification, but with the ease of use of true scalability; when I want more units, I get them. But if a piece of underlying infrastructure goes down, it doesn’t impact me. All at the same price I currently pay (or perhaps at a very slight premium).

Throw in easy backup and restore, and a button I can push that will put up a “this site is temporarily down for maintenance” page (hosted elsewhere but without me having to pay for a whole other instance), plus the ability to bring up my new stuff while leaving the old stuff intact until I’ve verified that the new stuff works, and a built-in “scalability” simulator that takes a copy of my entire site, back-end, and infrastructure, right down to the hardware, and tests it at all the different points of failure, and you have a winner.

Sure you can pull a number of pieces together to do all this, but a one-click solution available to web sites large and small? Like I said, easier said than done.

What the Amazon Outage Taught Us
What the Amazon outage really taught us is that being down for more than few hours of late night scheduled maintenance is simply unacceptable for today’s real-time startups. Users don’t care that your infrastructure provider had an issue: they blame you. With users and customers spread all around the world, there is no window longer than a couple hours when it’s acceptable to be down. And even that is pushing the limit. Frustrated, users write permanent posts and tweets, while your numbers suffer. The impact lasts far longer than the time your site was down.

Would those startups that were impacted by the outage be better off running their own hardware and data centers? Not necessarily, for all the reasons they didn’t do that to begin with. It’s expensive, it’s a lot of fixed cost, and it requires significant up front capital expenditure. It also requires not only software but hardware expertise as well – actual data centers and servers, a near constant turnover of old hardware for new, and long-term data center contracts.

No doubt startups that were impacted will be revisiting their architectures, changing their redundancy approaches, and accelerating those long-planned yet oft put-off re-architecting projects that will make them truly scalable and redundant. They may even put in more simulation tools to help them test usage spikes, infrastructure failures, security attacks, and data corruption.

But what they’ll really be hoping for is a solution that gives them the flexibility to scale with the reliability that comes from being isolated from other users of the same infrastructure, down to the physical level, at a price they can afford.

Perhaps after the finger-pointing is over and the black-eyes are done being iced, the greatest outcome will be innovative new technologies, which will eventually trickle all the way down to today’s single instance, single application user, whether it be one of millions of small business owners hosting web sites, or a blogger like me.

Apr 23, 2011

doxo’s Steve Shivers on Fox News


doxo, which provides consumers a secure digital file cabinet for receiving and organizing bills and other important documents, announced in February that it had closed its $10 million Series B financing to support accelerated growth of the company.

Apr 18, 2011

Entrepreneurship: An Art Not A Science

Steve Blank and the Pivot

I was fortunate enough to attend one of the final Lean LaunchPad classes taught by Steve Blank, Ann Miura-Ko, and Jon Feiber at Stanford. One big message was on the pivot. It was incredible. In 10 weeks, students formulated an idea, “got out of the building” to get feedback on it, and then in many cases, pivoted to a completely new idea (or a distantly related one). All in 10 weeks.

Steve sat down with a company founder to brainstorm about his business. He pushed hard on the pivot.

“How long do you wait before you pivot?”

“Only you know the answer to that. That’s what makes entrepreneurship an art, not a science,” said Steve.

Well put.

Apr 6, 2011