The new e-commerce experiments are here! The goal of these experiments is to expose how well these e-commerce platforms perform in their most basic setup when running on an Amazon m3.large server. To test the scalability of each platform, we will run three hour-long tests to ensure reproducible results. This time around we are utilizing quick, preliminary testing to establish a rough baseline for the amount of users each platform can handle. In addition, the Web Performance Lab has modified the criteria that will be used to determine performance failure.

WooCommerce Experiment Methodology

The first platform we will be re-working in this series of experiments was WooCommerce. After initializing the WooCommerce store, a set of four scripts was created to simulate user activity. In our preliminary load test, traffic was ramped up from 1 to 1000 VUsers over the period of an hour, using a single script that completed the new account registration process in the WooCommerce store. Parameterizing this script with user data allowed us to create new accounts that we could use to log in with during later tests, while at the same time obtaining an estimate of how many users the system could handle. Error rates spiked to an unacceptable 14.9% at just under 200 concurrent VUsers, and the test was stopped leaving us with 147 new accounts created in the WooCommerce system.

The second preliminary load test was run using all four scripts and weighting the traffic accordingly to mimic typical user transactions.

Screen shot 2014-05-27 at 12.03.05 PM

Because we already had a rough idea of where the test would fail, we only ran the test for 15 minutes, scaling up linearly to 500 VUsers. This time error rates remained relatively low until around 220 concurrent users, where they spiked to about 12%. Based on these results, we decided to run the three scalability tests scaling linearly from 1 – 250 concurrent VUsers. Each test would hold the peak load concurrency of 250 VUsers for the last 10 minutes of the hour-long test.

Criteria for Failure

The tough question and the crux of why we’re re-doing these experiments really is determining what scalability means.

As performance engineers, we understand that all of the metrics matter and are tightly coupled. However, after a lot of research and discussion, the Web Performance Lab agreed upon a set of 3 main criteria to use to determine a point of failure in our scalability tests, based on if any of the three criteria were met. The following three criteria were used to determine the system’s scalability, considering the impact of the performance scenarios:

Performance error rate greater than 5 %
Average Response Time for requests exceeding 1 second
Average Page Completion Time exceeding 10 seconds

All three tests had very similar tests results, varying less than 1% in the number of overall errors and requests sent between two of the tests. Due to this uniformity, we chose to analyze the median of the three tests. These are the results of the test, based on our new criteria for failure.

Data Analysis & Results

Performance Metric 1: Error Rates greater than 5 %

It wasn’t until the around the 35 minute mark in all three tests, error rates begin to increase, jumping up to 5.05% at 38 minutes in our median test run. The two types of errors we experienced included request read timeouts and internal server errors (500 errors). An internal server error means that the web site’s server experienced a problem, but could not give specific details on what the exact problem is. Request read timeouts indicate that the server took longer than the time limit LoadStorm’s parameters gave it to respond, so the request was cancelled. There were no specific requests in particular yielding these errors, which is good because it means there weren’t any particularly troublesome requests being made in the load test.


Performance Metric 2: Average Response Time exceeding 1 seconds for a request

It’s important to note that this is the average response time for all requests made in a minute interval, not an entire page. Taking a closer look at the performance metrics of all three tests, we can see several key indicators of performance limitations:

In the beginning of the test, the average response time for requests was very low, but it appears to increase in an exponential pattern as the concurrent users increase linearly. We can notice throughput and requests per second scaling proportionally with the linearly increasing VUsers. Interestingly, there appears to be a strong correlation between the point we notice those two metrics plateau, and the point where average response time spikes. This happens just before the 30 minute mark, which is indicative of scalability limits.

Performance Metric 3: Average Page Completion Time exceeding 10 seconds

The last key metric to study is the average page completion time. We chose to focus on the homepage completion for our testing purposes, as it’s the page most commonly hit. We realize we could’ve chosen a smaller time required for a page to complete, but we decided to be generous in our benchmarking. I expected to see page completion time begin to drop off near or slightly before the 38 minute mark, along with throughput and increase in error rate. Surprisingly, our criteria for failure was met at the 29 minute mark, at 147 concurrent VUsers, with average page completion time pushing 15 seconds. This means that the average user would have to wait over 10 seconds just for the home page to load, which is completely unacceptable.

Screen shot 2014-05-27 at 11.10.33 AM

Conclusions – Where does our system fail?

Based on our criteria for failure, we’ve determined the system to be scalable up to 147 concurrent VUsers. Our system’s bottlenecks appear to be on the server side, and our limitation is average page completion speed, a crucial metric of performance scalability.
Screen shot 2014-05-27 at 11.56.20 AM
To score our e-commerce platform, we took four different metrics that reflect volume and weighted them evenly. We compared the actual results with what we considered to be a reasonable goal for each metric for scalability.

Determining a set of criteria to base our scalability on was a challenge. As a young performance engineer, analyzing the results and studying how all of the different pieces correlate has been like solving a puzzle. Based on our original e-commerce experiments, I expected a slightly more scalable system. The metrics we used are absolutely essential as a predictive tool for benchmarking and identifying the real performance limitations in the WooCommerce platform.

What did you think of our experiment? Give us feedback on what you liked and on what you’d like to see done differently below!