Data is pretty much meaningless without something to compare it to. I could tell you that a certain individual weighs 246 pounds after dieting, but it makes more sense to also say that they weighed 310 pounds before they started. The effectiveness of the dieting plan can only be shown by comparing before and after. Similarly, in performing our optimization experiments, it just makes sense to gather some “before” data with which to compare our post-optimization speeds.
Gathering our benchmark data isn’t quite as simple as putting our servers on a bathroom scale. There are many distinct measures of performance, each of which is affected differently by different optimizations. Furthermore, performance metrics have a habit of fluctuating based on a wide variety of uncontrollable conditions. To that end, we decided early on that if we wanted a solid baseline for comparison, we needed to get all the “before” data that we could get.
Baseline Scope
Our original plan was to test our top ten pages, each on three different browsers, across a total of seven different locations around the world, five times each. After writing a small bash script to automate the process of collecting data, we ran into a bit of a wall: Our public key to webpagetest.org only allows us to run up to 200 tests per day with that key. After some discussion about exactly how long these benchmarks would take if we kept with our original plan, we decided to pare down the number of benchmark tests to the following:
- Our top 3 pages
- 3 browsers – Chrome, Firefox, IE9
- 6 locations – Amsterdam, Netherlands; Dulles, VA; San Jose, CA; Sao Paulo, Brazil; Sydney, Australia; Tokyo, Japan
- 6 tests each
- Total: 324 test results
Originally set at 3 tests each, we ran our benchmarks twice to get more accurate averages. We also switched one of the locations from St. Petersburg to Amsterdam due to repeated failed tests from that location.
The Problems
We ran into all kinds of weird problems while benchmarking. Some of which we can explain, many of which we simply cannot.
Failed Tests
Some of our tests simply failed outright, with no explanation as to why. The Chrome server in Sydney was a huge offender of this, often failing in the background and stating “Test completed, but no results were returned”. The Firefox server in Tokyo also had strange failures, stopping the test before it even began and giving us results such as “Load Time: 48 ms, Number of Requests: 0”.
We simply excluded the tests with no results from the benchmark. Then we ran more tests that did produce results and included those in our data.
Number of Requests and Bytes In
You’d think that performing the exact same test on the exact same server on the exact same page would give the exact same results as the previous one. Unfortunately, you would be (and we were) completely wrong. It’s understandable to have fluctuations with things like Time To First Byte (TTFB) and Load Time, but number of requests and number of bytes downloaded should remain constant between tests on the same page, right?
Wrong.
Often the number of requests reported by webpagetest.org would have strange variations, especially between browsers. Fortunately, we discovered why: failed requests were being included in this number. In pulling up the detailed CSV on tests with high numbers of requests, we found that many were coming back with a status code of “-2”. Wait, what? Negative two? How does that work? I searched for hours to find an answer to this, but the closest I could find was that “-1” was a webpagetest specific response indicating a generic failed request. That’s…really strange, but okay.
Funny story: Every test in Chrome and Firefox had these failed request errors. IE9 was the only browser, across all the locations we tested, to not have any failed requests. Go figure.
What struck me as even more odd was that the number of bytes in wasn’t consistent either, though these weren’t nearly as varied as the number of requests. As it turns out, part of this seems to be Google Analytics fault. There was a javascript file from Google that only downloaded some of the time, and it’s unclear why this happened.
Outliers
Outright failure wasn’t the only problem we found. Occasionally our tests would take a full extra second on DNS lookup, often doubling our overall load time. Even without the DNS problem, tests fluctuated by several seconds between similar tests for seemingly no apparent reason. While I was initially confused by these variations, our performance expert, Roger, says this is perfectly normal for web applications. Variations in the back-end such as concurrent processes on our server or user load can cause unexpected slowdowns, both on our machines and on the performance testing servers.
We decided to remove some of the worst offenders by classifying them as outliers and excluding them from our calculations. We defined outliers as:
Put another way, we removed any results that were outside two standard deviations from the mean. Two standard deviations should statistically keep approximately 95% of the data and throw out the performance anomalies.
This calculation eliminated a total of six results that were abnormally high.
So what have we learned?
- Do lots of tests, errors occur unexpectedly and more frequently than you might expect.
- Due to uncontrollable variations in the server’s back end and/or network latency, load times may vary by up to 5 seconds.
- Plan for benchmarking to take a while longer than you think it will so you can have time to sort out the gremlins.
- IE9 never failed a single request. I’m chalking this up to the webpagetest servers rather than the browser itself, because I simply can’t believe what I’m seeing.
We are excited to present the actual data for our benchmarking. In an upcoming blog post this week, we will publish detailed analysis of the performance results from all of the tests described above. Please stay tuned.