This post is a continuation of my Scaling Magento testing experiment (see part one).

Testing Phase

After getting the first testing environment setup, it was already time to test. Luckily for me, the VUser activity script was already finished back before this project was proposed. So running an identical load test was as simple as doing a Copy & Edit to our previous load test.
Here is the activity of the two tests. In this case, x refers to the xth VUser in the load test.

  1. VUser x hits the homepage
  2. VUser x logs in as “wil”, a previously generated customer account
  3. VUser x ends up on wil’s account page
  4. VUser x clicks on a certain category (i.e. furniture, electronics, apparel)
  5. VUser x clicks a certain product (y out of 25)
  6. VUser x adds that product to their cart
  7. VUser x views that cart

This scenario is specifically used to stress test the WebPerfLab’s Magento store.

Test Number 1

Testing environment #1 performed decently for being a single large and virtualized server. A chart of the load test results are available below:

Figure 3. Test #1 Load Test results (click to expand)

The single server performed decently until it reached the 12 minute mark. At which point, we began seeing 35 second timeout errors. At that point, we considered the test reached a “failure” state.

Test Number 2

Once test environment #2 was all set up, testing began again. Keep in mind, the URLs are exactly the same; so we can in fact use the exact same playbacks for both tests.

Figure 4. Test #2 Load Test results (click to expand)

While the results do look unstable, notice the actual data values. Throughput and requests per second are much higher, error rates are lower in the beginning, and the highest peak response time is only about 17 seconds!

Analysis

In order to determine the differences in the baseline to the scalability benchmark, further detailed analysis is needed. First was a side-by-side comparison of the response distribution:

Figure 5. Response time distribution for Test #1 (left) and Test #2 (right) (click to expand)

Upon closer inspection, the response times are looking good for Test #2. Take into account the difference in the response time scales for the two charts: Test #2 is half that of Test #1. So we are definitely seeing an improvement in both average and peak response times. After the response time analysis, I gathered more sets of data and looked for differences in percentage changes between the two tests.

Figure 6. Simple Analysis of the Two Load Tests (click to enlarge)

I collected data from every minute of the test. Then I took the average, maximum, and standard deviations of those data sets. Since test #1 is our baseline and test #2 is our scalability goal, we expected changes as described in my “Ideal Values” table. For example, we want our response times and error percentage to be low, our throughput and requests/second to be high, and our standard deviation to stay at zero.

For most cases, all the tests passed. One exception was the maximum error rate for Test #2 which increased from Test #1 to Test #2 by a whopping 144%! This could be contributed to the sudden increase in error rate near the end of test #2 despite overall good performance metrics.

Conclusion

Going back to the statement of work, we considered peak timeouts above 35 seconds a “failure” state. Luckily, despite the increase in errors, test #2 never goes above a 17 second peak response time.

In general, our average response time decreases. That’s great news that the average user essentially had her wait time cut in half, but consider how it relates to cost. Are these performance improvements good enough to justify another $10 an hour for hardware usage? At this point you might consider a percentage increase/decrease per dollar spent, and then determine if all that effort was worthwhile.

Similar Posts