Last year we saw plenty of downtime from Twitter. The response from Twitter was ‘growing pains.’ Since then, we have seen a large supply of posts from the development staff at Twitter describing issues and concepts related to their up-scaling efforts. Many users posted on the need for a push messaging system. Other users blame the foundation of Twitter’s problems squarely on Ruby on Rails’ supposed failure to scale. Twitter does not appear to want to change or remove Ruby on Rails any time soon. It is not my place to make a distinction on this topic. However, I would like to discuss some of the causes, results, and handling of the downtime related to Twitter’s scaling.
The aspect of Twitter’s reaction to downtime which I appreciated the most was the transparency. The team did not give away crucial or especially technical details of their architecture, but they were forthright in their acknowledgment of fault, determination of contributing factors, and estimation of services continuing to be affected. They even stopped to take the time to respond to blogger’s individual questions. This is not typical, but I believe it is appropriate due to the fact that their sole service had high downtime and climbing error rates.
The elimination of unhealthy features can save the core service. The reduction of IM service in order to conserve the stability of the core system was a wise choice. The problem section can be removed and repaired while the primary service continues to serve clientel through alternate venues. Users of Twitter have a predetermined method of checking their updates. In many cases, they may not settle for alternate methods of viewing status. However, emergency viewing is available.
Monitoring tools were added after large failures in Twitter’s service. It would have been wise to have these tools prior to a failure. The time taken adding tools could be spent finding the errors. None the less, it is never too late to add low-consumption monitoring tools during deployment and high-clarity monitoring tools during testing. Additionally, developers appreciate technical specifications to errors. Ruby on Rails’ developers seem particularly interested in the incidences at Twitter. Twitter’s success or failure is a reflection on the scalability of Ruby on Rails for some.
Occasionally, developers will muse about the aspects of the design. The blogging community requires constant and instant gratification. Therefore, prolonged instances of failure should be prompted with in depth explanations of the errors involved, as well as, details of why the problem has not been corrected or is reoccurring. This can be frequent during unanticipated needs for scale. It is not easy to instantly scale out existing hardware and rebuild software to handle increased distribution. In fact, original design of Twitter was not originally modeled as a messaging service as explained by Alex Payne.
If Twitter does not grow, then someone else will.
http://blog.twitter.com/2008/05/i-have-this-graph-up-on-my-screen-all.html
http://blog.twitter.com/2008/05/too-much-jabber.html
http://blog.twitter.com/2008/05/not-true.html
http://dev.twitter.com/2008/05/twittering-about-architecture.html
http://dev.twitter.com/2008/05/youve-got-qs-weve-got-as.html
http://dev.twitter.com/2008/02/if-you-build-it-they-will-riff.html
http://dev.twitter.com/2008/02/solving-case-of-missing-updates.html
http://www.sitepoint.com/blogs/2008/06/06/did-rails-sink-twitter/
http://www.hueniverse.com/hueniverse/2008/03/on-scaling-a-mi.html