A Little Cache Goes a Long Way

Drupal is the target of load testing for this series of articles. If this is your first time reading any of the articles in this series, please review the introduction and summary called Load Testing Drupal for a good context of what we are doing.

Our first test plan focuses on anonymous users coming to a Drupal site and browsing content. We believe this covers the majority of Drupal traffic in the real world, and reflecting actual user traffic patterns is a critical component of effective web load testing. This plan is a good place to put a stake in the ground.

As designed for all of our Drupal load testing, we will use a LAMP stack. The test results presented in this article will be from tests running on a small Amazon EC2 server instance. Our Drupal application has 1,000 users and 10,000 pages.

In the beginning, we didn’t change any of the default settings of Drupal, Apache, or MySQL. We left everything as it comes out of the box. We will make some simple adjustments later.

We set a benchmark of 5 second average response time as the upper limit of acceptable performance. Anything above 5 seconds was considered a failure.

Using LoadStorm, we created a test plan with two scenarios. The first specifies which pages the virtual user would visit and in a particular order. The second has our load testing tool randomly select pages on the site.

In the first scenario, we visited 5 different stories and returned to the home page between stories. There were 10 steps total and a random pause of 10-60 seconds between each step to simulate realistic user behavior (reading the page content before clicking again).

We ran three iterations of the test plan in order to confirm accuracy and eliminate statistical anomalies.
Each load test iteration ran for 30 minutes starting at 10 users and ramping up to 100 users. Here is a screenshot of the load test parameters.

The average results were disappointing to say the least. We were only able to get to an average of 67 users before our average response time started to go above 5 seconds – our benchmark time for failure.

Here are the graphs that LoadStorm produced during one of the test runs.


From the above graph we get a couple of data points that are of interest. At just over 14 minutes into the test you will notice in the top graph that both the Requests per second and Throughput start to vacillate. They begin to go up and down in a somewhat irregular pattern. This indicates that a limit of some kind has been reached. In a well performing system you should see both of those value continue to climb as the users increase. Obviously at some point every system will level off and that is the maximum number of users that system can handle. So even if you did not have the bottom graph you would know that something wasn’t quite right. However, without the bottom graph you wouldn’t know what your users were experiencing. At the same time you see, in the second screen shot, the Response time, average start to climb rapidly and not too much longer after that errors start to appear. Some systems will just slow down and not necessarily throw errors. The pages will just take a very long time to load. But in our case we start to slow down and we throw errors and all the errors are 408 – Request Time out.

In this first scenario we averaged 1.63 Requests/sec and 34 KB/sec Throughput.

So…in view of our dismal results out of the box we knew some changes had to be made. We are all about small changes for big improvements. We also knew that adding more memory or another processor was not an option. We wanted to get THIS configuration running as well as possible. One of the difficulties with performance engineering is that there are literally dozens of things that can be changed and millions of different combinations. Because of that, we start with small changes and see where we can get some big performance increases.

Turn On Caching

The first thing we did was change the configuration of Drupal and enable Normal Page Cache. We left the minimum Lifetime Cache at ‘None’ and Compression enabled. We then ran the same test again. And the results were impressive. We were able to run all 100 users with no issues and our Response time, average stayed well below .5 seconds. In fact, our Response time, average was .034 seconds.

Here are a couple of screen shots from one of the test runs.


In comparison to our earlier tests we saw some significant performance gains.

  • Requests/second went from 1.63 to 2.7
  • Throughput went from 34 KB/sec to 55 KB/sec

So we were feeling pretty good about ourselves at about this point. Our performance has greatly improved and we are not even coming close to our 5 second threshold. Being the masochist we are, we decided to turn up the pressure. We then ran some larger tests of 200 and 500 users and we were pleasantly surprised by the results. That one
small change to the Drupal configuration really did pay some nice dividends.

Below are the results for 500 users since 200 users didn’t make the box sweat at all.

Even at 500 users we are not doing too bad. We are noticing a little bit of leveling off in the Throughput and Request per second about 48 minutes into the test which indicates we might be reaching a limit but our Response time, average is still very good. We are still below 1 second while we are starting to get some larger spikes.

At this time we started to wonder if our test was flawed. By turning caching on were we skewing the results because all the data was in cache and we were not hitting anything but the same 10 pages over and over? That is when we came up with our second scenario. We used the Click Random Link feature of LoadStorm so that we were not hitting the same pages over and over. We were now guaranteed not to be hitting what was in cache all the time.

Using our new scenario we ran our original 100 users and again were pleasantly surprised by the results. Below is a screen shot of the results.
We can see that our performance is almost as good as when we scripted the 10 steps that each user was to take. The only exception is that our Response time, peak has much more variance and greater spikes. Our Requests per second and Throughput are almost the same whether we hit

the pages over and over again or hit random pages. This indicates that Drupal is able to reuse considerable portions of each page for each request. A good design move by Drupal.

At this point we want to hurt this server. So we throw a 1000 users at it and see if we can figure out where the breaking point is with Drupal caching turned on. Once again we are pleased with the results. Both because the server did pretty well for a while and because we finally
made it give up and cry for mercy.

Below is a screen shot of the 1000 user results.
The server did pretty well for a little over 45 minutes. At about 926 users we were averaging 26 Requests per second at 765 KB/sec. Our Response time,average was 4.5. Then the bottom fell out. After that the server was just overwhelmed and it started to throw errors and time out

requests.

For sizing purposes if you have a mostly anonymous user base accessing your Amazon EC2 Small server Drupal instance, you could reasonably expect to service at least 800 concurrent users with good confidence.

In load and performance tuning a small change can make such a huge difference. In our case to accommodate a high volume of anonymous users we just had to enable caching in the Drupal application. But… will that same small change help when we add authenticated users? Or will we be right back where we started – scratching our heads wondering why our performance is poor? Find out when we publish part two of our case study.

Similar Posts