failures

Forgive me for stating the obvious, but web applications are a critical part of global business in 2011. I see no alternative other than more dependence by companies everywhere on web software and Internet infrastructure. In my opinion, all business trend data predicts greater overall web usage, more complex application architectures, and tremendous spikes in extreme traffic volumes.

Critical Applications, Yet They Aren’t Getting the Investment Needed

ComputerWorld last week made a definitive statement regarding the critical nature of web applications:

Those who are unprepared are vulnerable to service outages, customer dissatisfaction and trading losses – and often when it hurts the most. Successful businesses understand the need to assure service and application availability if they want to retain customers, deliver excellent service and take maximum advantage of the opportunity their market offers.
This is not a theoretical problem – just look at the recent challenges for the London 2012 Olympics andTicketmaster. Just when everyone wants to do business with you, you’re not available.

The London Olympics site was overwhelmed by high demand for tickets and many buyers received the message, “We are experiencing high demand. You will be automatically directed to the page requested as soon as it becomes available. Thank you for your patience.”

That’s a failure even if the representatives of the site said it had not crashed. Performance failure…pure and simple for the whole world to see.

Examples of performance failure like this seem to occur weekly, if not daily, somewhere in the global business universe of websites.

Transformative Moment? When Global Retailers Fail!

Recently Target.com crashed under extreme user volume. They cut a deal with a designer line of knitware (Missoni) and promoted a special sale on the morning before products were sold in stores. By 8:00 a.m. EDT, the site was crashing. The Boston Globe went so far as to say:

”…the Missoni mess could be a transformative moment in the relatively brief history of e-commerce. Retail analysts say it shows that even though online shopping has made major strides since Victoria Secret’s website famously faltered during a 1999 webcast, companies still may not always have the technological muscle to meet consumer demand for such frenzied promotions.”

When will retailers learn? When will marketing departments going to consider the technical ramifications on their campaigns and launches? When will the IT department escalate performance engineering to a high priority? When will we stop reading about sites crashing under heavy volumes of traffic?

Hopefully never! Because these stories are great examples why you need LoadStorm.

Target Inc.’s website crashed yesterday due to special promotion. Apparently, the discount retailer has cut an exclusive arrangement with an Italian luxury designer called Missoni, and it would seem that the online sale of Missoni knitwear generated enough buyers to bring the site down.

I sure would like to know how many concurrent users killed it. Wonder how many requests per second the Target site was handling with less than a 5 second response time?

Can there really be more than a few hundred knitwear aficionados that would hold Missoni goods in such high esteem? What are the odds that those few hundred would all be anxiously awaiting the online sale and access the site simultaneously?

Perhaps it was 5,000 or 50,000. The result is the same – lost revenue, bad press, unhappy customers, and brand devaluation.

I found this interesting because it shows how even large companies running on their own “safe” data centers can experience massive performance failure. SaaS and cloud providers may get media attention for outages, but isn’t it somewhat hypocritical of Fortune 5000 CTOs to claim that their internal systems are “safe”? Come on, let’s be real. Systems have been experiencing poor performance for 60+ years, and hardware will continue to fail, and software will always have bugs, and architects will overlook weaknesses, and CEOs will annually cut budgets for performance engineering.

Performance testing is important. Web developers never want their sites or applications to crash under load. If they do, so what? Not all of the failures below are related to high traffic, but the failures are notable because of their impact on users and the far-reaching ripple effect.

From promising thousands of customers £7 top-of-the-line computers, to nearly causing an 800-plane pile-up, ‘inconvenience’ doesn’t even begin to describe some of the most famous application failures.

I was recently reading an article on Yahoo entitled, “MS: No repeat of Xbox Live holiday meltdown (hopefully)”. What immediately struck me as interesting was the fact that Microsoft is still relying on hope to avoid another black eye with their customers.

Google Apps (previously Apps for your Domain or AFYD) and Gmail has been down for some users over the past day. We have not been affected, but apparently some are having an extended outage with 502 errors.

I am curious about whether the affected users are “Premier” customers who have paid for the service. Also, the reports say that the gmail web client is down, but they do not say anything about IMAP, POP, SMTP services are working.

Yesterday, CNET reported that Facebook was down. It appears that this was an isolated and brief incident. According to Facebook, this was likely a hardware issue and per-user (one that affects only a particular user or small group of users). This would be consistent with what I have read about their architecture, that it is partitioned across thousands of servers.

Some prominent services have experienced problems in the past week or so.

Infoweek article by David Berlind

A weekend outage plagued Amazon’s S3 service with 8 hours of disrupted service, “elevated error rates”. Their Simple Queue Service, SQS was also affected but the EC2 service was not.

I expect that these kinds of issues with Amazon will disappear, and we remain enthusiastic users of Amazon’s Web Services. EC2 has been surprisingly reliable for us with no downtime in nearly a year of use.

Overall, it seems that Twitter has improved since the terrible downtime in the first half of this year. This particular failure affects only part of the service, but one that is no doubt frustrating for many users.

Here are a few notes about Twitter’s architecture from their blog:

  • Large single MySQL database with replication to slaves for read queries
  • Originally a Ruby on Rails application, although I think they have migrated pieces to C++

This is not a list that you want to be at the top of. 37 hours of downtime is a lot for any service that wants to keep their users. Their incredible growth has made them a poster child for scalability issues.

Reunion.com, Pownce, Bebo and hi5 round out the top 5 with over 12 hours of downtime in the first four months of the year.

Social Network Downtime by Royal Pingdom

LinkedIn was down yesterday, adding to the list of recent failures in web services. The company reported this to be related to a recent update and higher than normal usage. The recently added features are similar to those offered by Facebook and Twitter which allow users to send info about what they are currently doing to their connections.

Disclosure: We like LinkedIn. For business networking in the US, I don’t think there is a better service.

Several hours of downtime with Amazon’s Simple Storage Service last Friday caused other services that rely on Amazon’s infrastructure to have problems. Twitter, Tumblr, SmugMug and others that depend on S3 were down or having problems. This challenges the wisdom of using cloud computing but it is important to distinguish between availability and data loss. In this case there was limited availability but no data loss.

Similar Posts