How to Analyze the Results of a Large-Scale Load Test

Running a large-scale load test involves multiple challenges. The first challenge is to create the right testing environment that can realistically simulate thousands of concurrent users.

To simulate such a load realistically requires tens of dedicated servers residing on a high bandwidth network.

Any compromise with the testing environment can cause the test to appear to have excellent results providing false confidence when in fact the testing environment did not have sufficient strength to provide a realistic simulation.

To describe the way I approach a test report I will use an example of a test executed few weeks ago by Tescom Singapore using BlazeMeter. The test capacity was 12,000 simultanious users.

The large-scale test was comprised of 45 dedicated servers scattered all over the world together simulating a load of 12,000 concurrent users.

With BlazeMeter, each dedicated server enjoys a dual CPU, 1.7-7GB of memory and up to 100-300Mbps of bandwidth connection.

Tescom created the load script that simulated several groups of users, each executing a different business process. For example: 10% of the users will log in and view some pages, 30% will search and read articles and the rest will do general browsing. All groups will operate in parallel creating the most realistic simulation related to the website under test.

The script included a gradual load scenario which is always good to use as it is very helpful in identifying problems due to the very gradual increase in the load.

The test results were quite surprising and also educational. It is sufficient to have a quick look at the reports to see that the graphs yield very interesting conclusions.

I will use this article to describe these conclusions and illustrate a suggested approach while analyzing such reports.

Step One - Look at the Over All Average Report

My first step would be to look at the Response Time Vs Users for All transactions involved in the test. By "All" I mean the average response time of all requests including pages, images, CSS and JS files etc.

Large Scale Load Test

After having a quick look at the report, it became obvious that there was a problem. From the report it seems that at about 07:42:00 GMT the response time began to change and started to increase. Up until that time, the average response time was about the same. Up to almost 2,000 concurrent users, the response time was steady at a level of about 600ms.

Going from Idle to Sensitive

Looking at the above report several points become obvious:

  • Average response time while the website under test is still not sensitive to load. We call it Idle Time. This is the average response time when only a few users are visiting the website under test and up to the point the website under test begins to be sensitive to the load. In this case it's about 600ms.
  • The point where the website under test becomes sensitive to load. We call it the Load Sensitivity Point. From that point the response time started to increase as the load increased.
  • The absolute time of the Load Sensitivity Point. In this case, it was 07:42:00 GMT. This point enables one to identify the number of users that were accessing the website under test at the point where the website under test became sensitive to load.
  • The number of users accessing the website under test during at the Load Sensitivity Point.

Step Two - Look for a Bandwidth Bottleneck

It's not always the case that a problem necessarily results from a bandwidth bottleneck, however, a bandwidth bottleneck is very easy to find. For this we need to look at the throughput report.

Hits Vs Users Response Time Vs Throughput

Looking at the reports, there is an obvious bottleneck that is most certainly related to bandwidth. With a normal test (a test without bottlenecks), the throughput consumption would have increased and reached its limit only when the test reached its full capacity. In this case the throughput consumption should have continued to increase until the full 12,000 users were accessing the website under test. The full load capacity was reached at 09:26:00 GMT, while the bandwidth consumption reached its limit at 07:40:00 GMT. The probable reason is a bottleneck.

Identifying Bandwidth related Bottlenecks

Looking at the above report several points become obvious:

  • The potential throughput limitation. In this case it's close to 1.4GB per minute which is calculated to ~187Mbps. This is only a potential bottleneck. It should still be verified.
  • The point in time when the bandwidth reached its limit. This point will help us to identify the number of users that were accessing the website under test at that point. It's no coincidence that this is the Load Sensitivity Point we mentioned earlier.
  • See that there is an actual limit for the bandwidth in the test.

Although it is obvious there is a bottleneck, it is not obvious that the bottleneck is related to bandwidth. There is an easy two-step process to determine if the bottleneck is related to bandwidth.

  1. Test with a browser from an external location during the load. With a bottleneck related to bandwidth, the perceived behavior should resemble the one in the report (i.e. very high response time).
  2. Test with a browser with in the same LAN of the website under test. If the results are better, it means that the limitation of the connection of the LAN to the WAN was probably reached or in other words - bandwidth bottleneck. If the result is the same as indicated in the report it can mean a different bottleneck probably not related to bandwidth.

During this test, Tescom tested with a browser within the LAN and the response time was very good. Testing from an external location presented poor results which is the same as what appeared in the real time report. This confirmed that the bottleneck is related to bandwidth.

Step Three - Looking for Errors

Reported errors can teach us a lot about the website under test performance. The most educational errors to encounter are 5XX errors that actually tell us about the system status. However, usually the case is that the website under test would stop responding before even generating any errors. In this case we will see many timeout or disconnection errors, as this was the case in this test. At a certain point the website under test stopped responding all together. Apparently it crashed at the point of 9,500 users.

Cloud Testing - Learning from HTTP Errors

At 08:58:00 GMT, numerous errors of type connection timed out and socket errors were found resulting from the website under test not responding any more. At that time about 9,500 users were accessing the website.

Step Four - Correlating the Load Report with the User Experience Report

It is very important to correlate the load report with the user experience report. BlazeMeter automates two different systems to get comprehensive reports. The first system is for the load. Based on JMeter, BlazeMeter launches numerous servers that generate a load according to a load script. In parallel, BlazeMeter uses a different system based on Selenium to automate the launch of real browsers during the load period to measure render times and other KPIs as they are perceived by a real browser. The two systems are not connected but work in parallel to complement one another.

User Experience Report

Looking at the report, one can see that the user experience correlates with the load report. The render times measured by the browsers that were launched during the load present a much higher render time during the load than while the load was beginning to increase (point B).

Clicking on any joint on the graph, presents a waterfall breakdown of the browser request/response. It was obvious that some parts of the response were already missing at that point (point A). After few requests, the website under test stopped responding all together.

More information can be gained by reviewing the user experience report. For example what causes the high render time, whether it's the connection time to the server, the wait time, the response generation time. These can all be clarified by reviewing the user experience report and lead to identifying and thereafter fixing the performance related problems. Obviously once fixed, don't forget to test it again to verify the problem was actually fixed.

Going a Bit Further

The load report presents KPI measurements as perceived by the load engines. The user experience report presents measurements as perceived by real browsers. The last report presents measurements related to page speed optimization as dictated by Google's Pagespeed.

Reading, understanding and implementing the conclusions from this report can help to optimize the website under test performance and also increase the ranking provided by search engines.

Pagespeed Optimization

Conclusion

Performance is usually a complex issue. This is even more relevant for large scale load tests. The origin of a performance related problem is not always clear. If you are lucky, you will see the problem with one glance of the report. Usually you will need to isolate problems by testing numerous load scenarios, then try to fix any problem found and test again.

BlazeMeter is a tool that can provide realistic simulation and load testing. It completely hides the complexity and the challenges related to creating such tests. The hardware, the software, the configuration, bandwidth and resource management required to generate a realistic and scalable test are all managed for the user. The reports are generated automatically in real time and provide measurement on every aspect related to performance. The ease of use and the short time to test enable the user to execute tests time and time again, identifying problems, fixing them and testing again to validate they were fixed.

Add new comment