I found this interesting because it shows how even large companies running on their own “safe” data centers can experience massive performance failure. SaaS and cloud providers may get media attention for outages, but isn’t it somewhat hypocritical of Fortune 5000 CTOs to claim that their internal systems are “safe”? Come on, let’s be real. Systems have been experiencing poor performance for 60+ years, and hardware will continue to fail, and software will always have bugs, and architects will overlook weaknesses, and CEOs will annually cut budgets for performance engineering.

Twitter

If you’ve used Twitter much in the past year, then you aren’t surprised the web application has been failing. The Fail Whale is famous. Make that infamous. Twitter has suffered more actual downtime than usual through much of June. Some reports say that we should expect more outages through July.

The Fail Whale officially spoke to the media and blamed the downtime on the World Cup. Yeah…right. It’s been apparent for a long time to most of us that Twitter is a victim of its own success. Too many users, too many tweets, too rapid growth.

Outages began on June 11 and continued with poor site performance for about 5 days. They are working on improvements, but Twitter has publicly admitted to internal network deficiencies. I hope they get their system tuned soon. It would be nice if they could stay ahead of the growth curve too, but I live in the real world and expect to see the Fail Whale periodically forever. Or at least until they get a real revenue model.

Intuit

Intuit is well-known to say the least. Their TurboTax Online runs hundreds of thousands of concurrent users around tax time. Did you know that they are still using the original C++ application with each user’s own web-enabled copy instantiated on the server? Yep, I know. Crazy.

I guess I’m not shocked that Intuit had an overnight outage that prevented SMB customers from processing QuickBooks Online, Quicken, TurboTax Online, and QuickBase through the night of June 15. No credit card payments, no taxes calculated, no books reconciled for an estimated 300,000 small businesses. It was reportedly caused by a power failure that occurred during routine maintenance.

NetSuite

Popular and aggressive SaaS ERP provider NetSuite went down on April 26 for several hours. Its apps were inaccessible to customers worldwide. According to NetSuite, the cloud applications were down for under an hour, but reports say customers had performance problems for at least 6 hours. NetSuite blamed a network issue for the outage. They publicly stated that they had no idea how many customers were affected. Duh. This is a NYSE company. My guess is that they just didn’t want to put the large number in print…no number does less credibility damage than hanging the truth out there.

Microsoft Live

Ok, I’m a skeptic about any Microsoft software reliability. Since 1983 I’ve been fighting the blue screen of death. There are MS haters, but I’m not really in that camp. I have been a MS Certified Professional and my company has been a MS Certified Solution Provider. That said, I love my Mac.

So when on January 28 Microsoft Online Services had poor performance and outage for such offerings as Microsoft Business Productivity Online Standard Suite (BPOS), I wasn’t shocked. A MS post stated it was a network infrastructure problem, but it didn’t say how long the poor performance lasted or how many customers were affected. You don’t think that Windows was involved do you? Me neither.

A couple of weeks later MS’s Windows Live services were reported as down on February 16. Not shocked this time either. Keyword: Windows. What was also not surprising is that they blamed hardware. However, doesn’t it seem a bit trite to have the world’s largest software company describe the cause of global online services outage as “due to a server failure”?

According to Microsoft, there was an issue with the Windows Live ID service and log-ins failed for some customers, which increased the load on remaining servers. Things were back to normal after about an hour, Microsoft said.

Slightly understated don’t you think? I’m sure it had nothing to do with the instability of Windows.

The Planet

The Planet is one of the world’s largest Web hosting companies and server hosting providers. It serves more than 20,000 businesses and more than 15.7 million Websites and has more than 48,500 servers under its management.

“On May 2 at approximately 11:45 p.m. Central time, we experienced a network issue that affected connectivity within The Planet’s core network in Houston that may have prevented your servers from communicating to the Internet,” Jason Mathews, overnight technical support supervisor for The Planet wrote in one of the community forums. “Network connectivity service was fully restored in less than 85 minutes; however, your servers may have been impacted by this incident.”

“In our ongoing analysis of the occurrence, we determined that one of four border routers in Houston failed to properly maintain standard routing protocols,” Mathews wrote. Yep, you are down. Poor performance. Dang hardware.

A second outage occurred on Monday morning. “We believe the network issues this morning are unrelated to connectivity problems customers in [Houston] may have experienced around 12 a.m. CDT”. Around 9:30 a.m. CDT, The Planet noted that services had been fully restored and analysis found that a circuit between Dallas and Houston caused the Monday morning disruption. Yep, hardware again.

Sage

Performance problems hit Sage North America in June, which included a 22-hour outage on June 1-2 of its order entry-system, online CRM, and internal systems such as email. out of commission, according to a report on TheProgressiveAccountant.com. Sage reported the cause was in its storage area network. Not sure if that means hardware.

EMC

EMC launched Atmos Online in May 2009 as a scalable online storage infrastructure built on its Atmos technology. On Feb. 17, it went down.

Users reported messages such as, “EMC Atmos Online Temporarily Down For Maintenance” and “No suitable nodes are available to serve your request.”

EMC stated that the outage was caused by maintenance issues, but did not elaborate. The day before, EMC announced a bunch of cool updates. Guess there was a breakdown in the implementation of those cool updates. Call it “maintenance” if you want to, but I suspect the budget for testing was cut on this project. I’ll even go so far as to say they probably didn’t performance test at all.

Similar Posts