BreadcrumbHomeResourcesBlog How To Achieve System Observability With BlazeMeter: The Guide March 10, 2022 How to Achieve System Observability with BlazeMeter: The GuidePerformance TestingBy Todd HallWith the ever-increasing use of microservices and the rate of change of a platform, it is getting more challenging to monitor and be able to quickly identify the reason for performance degradation in applications. As a result there has been an increased interest in the topic of system observability lately. Much has been written and discussed, but what does it all really mean? In this post, we will cover the topic of observability from the perspective of BlazeMeter Performance Testing. Further, we will speak to the common sense approach of where to look for the most meaningful metrics that will enrich your understanding of the health and responsiveness of your application. Table of ContentsWhat is System Observability? Is it the Same as Monitoring?Have We Achieved System Performance Observability?Achieving Deeper System ObservabilityAlternatives to System ObservabilityConclusion and Next StepsTable of Contents1 - What is System Observability? Is it the Same as Monitoring?2 - Have We Achieved System Performance Observability?3 - Achieving Deeper System Observability4 - Alternatives to System Observability5 - Conclusion and Next StepsBack to topWhat is System Observability? Is it the Same as Monitoring?System observability means the ability to understand a certain situation and explain it. In order to be able to ‘observe' something, we need to have enough information to be able to discern what is going on. This information is collected from monitoring. In other words, monitoring is a first step towards observability.Let’s think of this through an example.A magician stands in front of you. She shows you that there is nothing up either of her sleeves. Then she takes off her top hat. The hat is tall black and deep. She then presents it to you for your inspection. The magician takes a large red handkerchief and waves it over the overturned hat and says a few choice words. Suddenly, a rabbit pops up out of the hat.You certainly were present in the situation, but did you really monitor it so you could gain observability and understand what happened?The key here is that if you had really monitored the entire trick, you might have been able to notice that the hat had a false compartment at the top, and the rabbit was hanging out there until it was pulled out for everyone to view. If you had all that information and the tools to interpret it, you would have been able to observe the trick.Without this, you were simply left wondering how the magician fooled you!System Performance ObservabilitySystem observability is achieved when we can explain why the performance of a system is what it is. To achieve observability, we need metrics to analyze. Being able to generate predictable and repeatable traffic against systems is a great way to collect these metrics. These will help us observe the performance capabilities of an application under test so we can determine whether performance is acceptable or not.Getting Started with Performance Observability with BlazeMeterBlazeMeter is a great tool to help with the generation of performance loads against a system under test, so we can collect metrics that will help us understand and explain performance. What we really want to find out is this: “Does the application respond within the acceptance criteria - both for functional and performance tests?”.By defining functional tests, you are defining how the application should respond. Taking this to the next level with performance tests, you are gaining an understanding if the application will continue to function as expected while it is under a load of your choosing.All BlazeMeter components generate synthetic transactions against defined endpoints and record the results. These testing and reporting capabilities, including the ability to compare different runs of load tests against each other, is where the rubber meets the road from an observability perspective.The primary metrics that are highlighted on the Summary page of a given report are:Test Start / End / DurationMaximum Concurrent UsersHits per second (Average Throughput)Error %Response Times:Overall Average90th PercentileAverage Network Bandwidth (Kilobytes per second)These metrics are helpful at providing a birds eye view of system performance.As mentioned, metrics are the first step towards achieving observability. To gain a more detailed understanding about system performance, the Graphical Timeline report provides much more information.This report enables seeing how the metrics from specific calls perform over time. Keep in mind that you can observe response time averages, minimum, maximum and percentile statistics here as well.If you want to do some deeper analysis, the tabular form of this data can be downloaded from the Request Stats page of the report. In the screenshot below - notice how the average response time is typically much lower than the 90% or 95% so. This would indicate there is a daily widespread responsiveness.So often we just look at just hits per second and average response times, but fail to visualize that there are differences in how the same transaction performs for all users. Be sure to realize the observability you have available to you with BlazeMeter’s reporting capabilities.Back to topHave We Achieved System Performance Observability?Performance testing metrics provide us with visibility into the system. However, if the test is not meeting the goals that have been established, this is an opportunity to dig deeper into the data to reach a more comprehensive understanding of the different components’ behavior.We will divide this quest into three:Ensuring BlazeMeter isn’t the bottleneckSpeed bump analysisThird party services1. Ensure BlazeMeter Is Not The BottleneckThe first step to understanding a system is to ensure you are really inspecting your system, and not an external application. If your load engines are overwhelmed, they may not be able to efficiently drive the test. So make sure your BlazeMeter engines are not impacting your test results.Ensure your load engines are running no more than 75% usage of either CPU or Memory consumption. If they are, understand what is driving this. You may have to lower the targeted number of users per engine, or you may have to alter the number of labels that are being managed by a given test. Here are links to good information about calibration:Calibrating a JMeter TestCalibrating a Taurus Test2. Speed Bump AnalysisIf your test results show issues you need to further understand, we recommend spending the time doing so. You really want to understand what, if anything, is holding up the processing of requests in this precious test environment. It is better to learn the issues here rather than in production when real users and real business transactions are stalled or lost.Review the metrics as reported by BlazeMeter in the Timeline and and Request Statistics pages of the report. Which are the biggest drivers? As mentioned above, don’t just look at averages, look at 90th percentile, and maximum values. Understanding Errors to Achieve System ObservabilityError Reporting in BlazeMeter is by design at a summary level. It provides overall counts and types of response codes, but it does not provide details as to the details of the request and responses. In order to understand the errors more clearly - you should understand how to dig deeper to get these details.For example - here we have a run with nearly 20% errors:Going to the Errors page of the report, you have options to group errors by Label, Response Code, or Assertion Name.You can look at the errors from any of these perspectives, but when you do, you can only go so deep. So for an example - you can drill down and see errors here that had 304 and 403 response codes:You still cannot tell what exactly was called. To see actual calls and responses, go to your logs, and inside of the artifacts.zip file you should find the error.jtl file:Download the artifacts.zip file, unzip it and make sure you have the error.jtl file:Use JMeter to read the error.jtl file. Open up JMeter and then create a View Results Tree Listener:To read in the error.jtl file, look at the section ‘write results to file / Read from file’ - provide path and filename for the file, or click on Browse to search for it.Now you can see that the file has been read by JMeter and the red labels show that these were calls that resulted with an error. Confirm the request that was made:And now you can see exactly what was provided in the response:This should give you the information for a given request and why it failed.3. Third Party ServicesWhen you are dealing with a third party service, you typically are calling an endpoint over which you have little control. Just put yourself in a good position to work with your service provider by being able to nail down frequency and performance of calls. Provide the right information so that they can be in a better position to support your needs. This is important so whoever is supporting them has a chance of figuring out what, if anything is wrong. Don't just call us up and say 'it does not work'. Explain what was attempted and what the response was.Be ready with the following information:Time of the test - This would include both beginning and ending and the timezone. You just want to make sure you are giving them enough information so they are looking at exactly the same timeframe you are interested in.If there are specific windows of time within the test when the performance deviated from an acceptable range - highlight these times as well.Number of calls during the test as well as the characteristics of the calls. Are they a mix of typical calls, or are they hardcoded to a single or just a few options?The response time metrics that are generated by BlazeMeter. Both for total test as well as identified trouble spots. Be able to share the following information:SamplesAverage Response TimeAverage Hits/sec90% percentile average response timeMin and maxAverage bandwidthError PercentageAre all requests having issues, or does it look like the issues are coming on a certain cycle or is it random?Knowing this information to this level should shed a strong light on where the issue may be coming from. In addition, this information will also provide insights into the systems that you have more access to.Mock Services from BlazeMeter also provides a very elegant solution to test for unavailable services. You can virtualize parts of the system that are not under test, or not available (eg, still in development), and get discrete insight into the quality and performance of what you’re testing. Mock Services realistically simulate the real-world behavior of a service. You can test your app under both good and difficult conditions with both happy paths and negative responses (slow response times, incomplete content, unexpected errors, or even chaotic behavior). For more information, see the Mocks tab under the product section of BlazeMeter.com.Back to topAchieving Deeper System ObservabilityYou can only go as deep into analysis as far as you have metrics to support the research. Therefore, to be able to dig deeper and understand why something may be taking longer than anticipated, we need more advanced metrics. These can be achieved through APM metrics, and additional metric types. These can all be exported into BlazeMeter reports.APM MetricsYou can install an APM tool within your environment to monitor the detailed health of:Web ServersApplication ServersDatabase ServersAnd further still, install a network monitoring tool to monitor the traffic between components.Typically using these types of tools requires a significant investment (both financially and time) to realize benefits. The investment is ongoing because one has to stay on top of it. The application will continue to grow and change over time (and acquire even more data that needs to be analyzed).Key APM MetricsWhen dealing with APM data, the biggest drivers of poor performance typically come from the following:ServersCPU usage over 80%If the server or process is really busy, it is a challenge to take on the next request.Memory paging If memory is overused and chunks of memory have to be committed to disk, this is a real speedbump.Excessive Disk UsageJava or .NET ProcessesGarbage CollectionExcessive Major Garbage Collection is a stop the world eventExcessive CPU of the process itselfDatabasesLocking / Blocked threadsExcessive DeadlockingTable Scans (lack of effective indexes)Current APM IntegrationsTo see which current APM integrations are supported, go to guide.blazemeter.com. Using the navigation pane, search for Integrations & Plugins > APM Integrations:Back to topAlternatives to System ObservabilityIf you are using an APM that does not have an out of the box integration with Blazemeter, then you can use a BlazeMeter API to import time series data into a BlazeMeter report.If you do not have an APM, you can leverage data from other sources, such as Splunk, or even native operating system commands to monitor your application environment, you can leverage the same API described below to import data into a BlazeMeter report.Importing External Metrics into BlazeMeter Reports for Deeper AnalysisLet’s say that you are trying to analyze data from multiple sources - and you do not have one of the configurable APMs that BlazeMeter supports. Then, you may want to import metrics from some other sources into your BlazeMeter report so you can document and analyze data in one place. You can conceivably import any number of metrics via an API that is described here.Likely candidates for this type of activity is to obtain information from logs, or from native system monitoring tools.Log DataSplunk or DataDog are popular tools to monitor system and application logs for events as well as uptime. By looking at logs we can tell:Startup and ShutdownEvent InformationMajor garbage collection is an example of an event that can affect application performance.Periodic batch jobs can affect application performance. Jobs like ‘accounting end of day’ or ‘purge abandoned shopping carts’ are examples of jobs that may enhance the probability of locking or contention for resources.Native System MonitoringOperating system performance metrics can be obtained by leveraging commands below. Examples of typical commands for this purpose:Windows system performanceperfmonUnix performance monitor nmonvmstatiostatBack to topConclusion and Next StepsWhen it comes to the topic of ‘observability’, be mindful of where you get your data and how to interpret it. You can only go as deep as you have metrics to support an analysis. Be aware that you have options to dig deeper with APM and other data sources. When you are dealing with a system that you have no visibility into, make sure you provide the right information to your service provider to make it clear what you are expecting and what you are getting in return.START TESTING NOWBack to top
Todd Hall Customer Success Manager, BlazeMeter Todd brings over 39 years of extensive IT Management experience over multiple disciplines—database administration, architecture, program management, and more. Todd uses his knowledge to provide extraordinary support to BlazeMeter and BlazeRunner customers.