Thank you! We will contact you to schedule your trial.

Monitoring with Percentiles

09/07/2018

What is best metric in performance monitoring – averages or percentiles? Statistically speaking there are many methods to determine just how good of an overall experience your application is providing. Averages are used widely. They are easy to understand and calculate – however they can be misleading.

This blog is on percentiles. Percentiles are part of our recent new 7.0 version of the ADF Performance Monitor. I will explain what percentiles are and how they can be used to understand your ADF application performance better. Percentiles, when compared with averages, tell us how consistent our application response times are. Percentiles make good approximations and can be used for trend analysis, SLA agreement monitoring and daily to evaluate/troubleshoot the performance.

How averages can be misleading

We can make the wrong conclusions from averages. For example: let’s assume the average monthly salary of a worker in a certain country is around 2000,- US dollar (this seems to be not too bad). However, when looking closer we find out that the majority in this country are labor migrant workers, namely 9 out of 10 people. They only earn around 1000,- US dollar. And 1 out of 10 (local inhabitants) earns around 11.000,- US dollar monthly (this is oversimplified, but you understand the idea). If you do the calculation you will see that the average of this is indeed around 2000, but we can all understand that this does not represent a realistic ‘average’ salary. This also applies to statistically monitoring application performance, and monitoring SLA agreements. Very high values influence the average very much. In reality most applications have few very heavy outliers that influence the averages way too much.

Percentiles explained

When you want to know how your application is performing from a high-level perspective it is useful to understand the concept of percentiles. A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the response time for a HTTP request below which 90% of the response time values lie, is called the 90-percentile response time. In the screenshot below this is 3.0 seconds (so 90 percent of the requests is processed in 3.0 seconds or less:

 

To obtain the 90-percentile response time value for a certain click action, sort all the response time values for that requests that are initiated by that click action, in increasing order. Take the first 90% out of this set. The response time that has the maximum value in this set is the 90-percentile value of the click actions requests.

Suppose for a click action there are 10 HTTP response time values are available: 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 seconds. After sorting, if I take out 90 percent response time values out as a separate set, I will get: 1, 2, 3, 4, 5, 6, 7, 8 and 9. Here 9 is the maximum value and hence is the 90-percentile value of that click action.

Of course, we want as many of our HTTP requests to have a very fast response time; so, in an ideal world the 50th, 95th, 99th and even the 100th percentile would be as fast as possible.

Percentiles in the ADF Performance Monitor

Look at the percentiles chart (right bottom) in a month overview of last June 2018:

The ADF Performance Monitor shows the average response time in blue, and the 50th, 90th, and 95th percentile plotted in black, grey and light grey:

At the x-axis the day numbers of the month June 2018, and on the y-axis the HTTP response time in seconds.

We can see the following patterns;

What does this tell us?

Month Overview – Active Users and Sessions

The ADF Performance Monitor has also a new chart on active end-users and HTTP sessions – very useful to evaluate number of end-users and sessions that are active on a managed server – or on all managed servers together. Later we can compare these values to all the other metrics in the ADF Performance Monitor like JVM, SLA agreement metrics, time spent in layers, e.g., but now also compare it to percentiles:

At the x-axis the day numbers of the month June 2018, and on the y-axis the number of active sessions and end-users:

 We can see the following patterns:

Trend Analysis

We can use percentiles for all kind of performance evaluations. In particular for regressions and trend analysis after new releases. Did we really improve the performance or not? Sometimes the performance increases or decreases after new releases – it would be useful if we the visibility to see and recognize this. In the ADF performance Monitor – especially in the month overviews – you can see this in a glance. If yes, the 50th, 90th and 95th percentile lines should decrease after you brought your performance improvements in production – indicating faster response times:

Like in the shown screenshot. A new release was brought to production on the 17th of June with supposed performance improvements. After that, in the remaining days of June, we see that the average response time, the 50th, 90th and 95th percentile went down -indicating that the new release indeed improved the performance.

Week, Day, Hour Overviews

In the very same way as on the month, the ADF Performance Monitor has end-user/sessions and percentile overviews on the week, day, and hour overviews. Here an example how a Day overview looks like – with metrics from a local demo:

Conclusion

Percentiles, when compared with averages, tell us how consistent our application response times are. When average response time appear to be extremely high and individual data sets seem normal this can be very useful to analyze the performance without the influence of exceptionally slow requests. Percentiles are excellent be used for trend analysis, SLA agreement monitoring and to daily evaluate the performance.

Share this article on social media!