load testing tool outage - storm in VirginiaLoadStorm provides the lowest cost load testing tool by efficiently using cloud computing resources. In 2008, we launched our service based on the Amazon Web Services EC2 cloud. It has been very good for us. Rarely have we had any reliability issues.

However, the US-East-1 region (Virginia data center) for AWS got hit by a storm (not LoadStorm) last night. They lost power for about 30 minutes which caused extended outage for our load testing tool as well as other well-known services includng Netflix, Heroku, Pinterest, and Instagram.

It is frustrating to us to experience any downtime, but I guess we are in good company. The data center outage was the result a powerful electrical storm struck the Washington, D.C. area, leaving as many as 1.5 million residents without power.

 

Where was the generator power?

Apparently it didn’t work correctly to prevent the servers from going without power. Hmmm….sounds like the engineering was inadequate. I know that type of equipment is expensive, but we are talking about one of the largest and most advanced data centers on the planet. There is no excuse. In the words of James T. Kirk in his scrap with Khan, “We got caught with our britches down.” Maybe the AWS engineers need to take the Kobayashi Maru test.

What’s worse is that customers were affected for more than 30 minutes as Amazon worked to recover virtual machine instances. “We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area,” Amazon reported at 8:30 pm Pacific time. Shortly after that message they said “power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.”

It took several hours to fully recover the EC2 server instances and the EBS (Elastic Block Storage) volumes. A number of big web applications and popular services were down for much longer. Although LoadStorm was back up by about 8:00 am MDT, Instagram was unavailable until about Noon Pacific time Saturday, more than 15 hours after the incident began. Heroku reported 8 hours of downtime for some services, and ironically, Heroku is a cloud infrastructure provider that uses AWS data centers.

 

Let’s not do this again please

We will be exploring ways to prevent this outage in the future. LoadStorm utilizes several data centers in the EC2 cloud, but we had some critical dependencies on Virginia for core processing. Again, we weren’t alone because this outage affected Netflix, which is famous for spreading its resources across multiple AWS availability zones. We do too, and the strategy should allow both Netflix and LoadStorm to route around problems at a single data center. LoadStorm and Netflix have remained online through past AWS outages affecting a single availability zone, but this one got us all.

We apologize for any inconvenience this caused our customers. We will have a team meeting on Monday to seek engineering changes that will allow us to avoid this type of problem in the future.

Thanks for using LoadStorm as your cloud load testing tool.

 

Facebooktwittergoogle_plusredditpinterestlinkedinmail