Monitoring with Percentiles
What is best metric in performance monitoring – averages or percentiles? Statistically speaking there are many methods to determine just how good of an overall experience your application is providing. Averages are used widely. They are easy to understand and calculate – however they can be misleading.
This blog is on percentiles. Percentiles are part of our recent new 7.0 version of the ADF Performance Monitor. I will explain what percentiles are and how they can be used to understand your ADF application performance better. Percentiles, when compared with averages, tell us how consistent our application response times are. Percentiles make good approximations and can be used for trend analysis, SLA agreement monitoring and daily to evaluate/troubleshoot the performance.
How averages can be misleading
We can make the wrong conclusions from averages. For example: let’s assume the average monthly salary of a worker in a certain country is around 2000,- US dollar (this seems to be not too bad). However, when looking closer we find out that the majority in this country are labor migrant workers, namely 9 out of 10 people. They only earn around 1000,- US dollar. And 1 out of 10 (local inhabitants) earns around 11.000,- US dollar monthly (this is oversimplified, but you understand the idea). If you do the calculation you will see that the average of this is indeed around 2000, but we can all understand that this does not represent a realistic ‘average’ salary. This also applies to statistically monitoring application performance, and monitoring SLA agreements. Very high values influence the average very much. In reality most applications have few very heavy outliers that influence the averages way too much.
When you want to know how your application is performing from a high-level perspective it is useful to understand the concept of percentiles. A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the response time for a HTTP request below which 90% of the response time values lie, is called the 90-percentile response time. In the screenshot below this is 3.0 seconds (so 90 percent of the requests is processed in 3.0 seconds or less:
To obtain the 90-percentile response time value for a certain click action, sort all the response time values for that requests that are initiated by that click action, in increasing order. Take the first 90% out of this set. The response time that has the maximum value in this set is the 90-percentile value of the click actions requests.
Suppose for a click action there are 10 HTTP response time values are available: 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 seconds. After sorting, if I take out 90 percent response time values out as a separate set, I will get: 1, 2, 3, 4, 5, 6, 7, 8 and 9. Here 9 is the maximum value and hence is the 90-percentile value of that click action.
Of course, we want as many of our HTTP requests to have a very fast response time; so, in an ideal world the 50th, 95th, 99th and even the 100th percentile would be as fast as possible.
Percentiles in the ADF Performance Monitor
Look at the percentiles chart (right bottom) in a month overview of last June 2018:
The ADF Performance Monitor shows the average response time in blue, and the 50th, 90th, and 95th percentile plotted in black, grey and light grey:
At the x-axis the day numbers of the month June 2018, and on the y-axis the HTTP response time in seconds.
We can see the following patterns;
- The 50th percentile of the response time is roughly 1 second (of a certain click action in a web page -for example the SaveEmployeesButton in an HR demo application). This means that 50% of the HTTP requests is processed in 1 second or less.
- The 90th percentile is around 2.75 seconds (90% is processed within 2.75 seconds)
- The 95th percentile tops out at about 3.25 seconds (95% is processed within 3.25 seconds)
- The average response time is around 2.0 seconds (blue line). It peaks on Tuesdays (5, 12, 19 and 26 June) at about 2.5 seconds
- The average response time during the weekends is with 1.6 seconds lower than weekdays (2.0 seconds).
- We can see that on Tuesdays – when the average response time peaks – and that while the 50th, 90th and 95th percentiles are more constant.
What does this tell us?
- That there are probably a few very slow requests (outliners) that influence the average very much. In this case it turned out that end-users were running many very slow reports on Tuesdays. Tuesdays were kind of ‘reporting-days’ and ‘mess’ the average response time.
- It all depends on our SLA agreement and how well our ADF application must perform. If having a lot of HTTP requests that respond between 2.0 and 3.25 seconds is acceptable for your application or SLA agreement, then you are probably doing well. Then you don’t have much to do except for analyzing the exceptional very slow requests (the 5% of the HTTP requests that take longer than 3.25 seconds) and figuring out if you can make them faster.
- If you need most of your HTTP requests completing in less than 2.0 seconds, then know you have a lot of work to do with optimizing your system since so many requests take longer than 2.0 seconds.
Month Overview – Active Users and Sessions
The ADF Performance Monitor has also a new chart on active end-users and HTTP sessions – very useful to evaluate number of end-users and sessions that are active on a managed server – or on all managed servers together. Later we can compare these values to all the other metrics in the ADF Performance Monitor like JVM, SLA agreement metrics, time spent in layers, e.g., but now also compare it to percentiles:
At the x-axis the day numbers of the month June 2018, and on the y-axis the number of active sessions and end-users:
We can see the following patterns:
- Tuesdays are busiest busy days with most end-users and sessions; we see peaks at 5, 12, 19 and 26 June 2018
- On the busiest day (19 June) there were more than 80 unique HTTP sessions active, and 70 unique end-users.
- On weekends there is very little end-user activity (around 10 unique end-users, around 15 sessions)
We can use percentiles for all kind of performance evaluations. In particular for regressions and trend analysis after new releases. Did we really improve the performance or not? Sometimes the performance increases or decreases after new releases – it would be useful if we the visibility to see and recognize this. In the ADF performance Monitor – especially in the month overviews – you can see this in a glance. If yes, the 50th, 90th and 95th percentile lines should decrease after you brought your performance improvements in production – indicating faster response times:
Like in the shown screenshot. A new release was brought to production on the 17th of June with supposed performance improvements. After that, in the remaining days of June, we see that the average response time, the 50th, 90th and 95th percentile went down -indicating that the new release indeed improved the performance.
Week, Day, Hour Overviews
In the very same way as on the month, the ADF Performance Monitor has end-user/sessions and percentile overviews on the week, day, and hour overviews. Here an example how a Day overview looks like – with metrics from a local demo:
Percentiles, when compared with averages, tell us how consistent our application response times are. When average response time appear to be extremely high and individual data sets seem normal this can be very useful to analyze the performance without the influence of exceptionally slow requests. Percentiles are excellent be used for trend analysis, SLA agreement monitoring and to daily evaluate the performance.