For my work as performance engineer for Hazelcast, I need to deal with quite a lot of statistics. Unfortunately statistics have never been my strongest topic; I really wish I paid more attention at high school. Some basic understanding is critical if you want to do a proper analysis of test data and in this post I'll be zooming into the problem of combining percentiles with means.

For this post I'll be focusing on Latency test: executing requests with a fixed rate and recording the response times. In case of latency based tests, mean latency and standard deviation are not very useful; you want to look at the percentiles e.g. p99 or p99.9. In this post we'll focus on the p99.

The test data

For this test we assume the following dataset: there are 200 requests (2 per second) for 100 seconds and the following latency distribution:

190 measurements are 1ms
10 measurements are 400ms: these are equally spread over the 200 requests

p99

The 99 percentile (p99) is 400ms since 99% of all measures values are equal or smaller than 400ms. This is easy to see; just sort sorting all items from small to large and jump to the 198 measurement which is 400ms, since the measurements from 190 to 200 are 400ms.

Click on the following link if you don't trust me.

p99 per 1 second interval

Having a p99 is very useful, but it won't help you if there are fluctuations (e.g. gc) during the test and you want to understand when this happened. For this you need to know how percentiles behave over time. In this case it is useful to determine the latency distribution on e.g. a 1 second interval instead of the whole distribution of 100 seconds.

To calculate p99 per interval of 1 second for our test we'll get:

10 intervals with 2 measurements; 1ms + 400ms.
90 intervals with 2 measurements; 1ms + 1ms.

If we calculate the p99 for each of these intervals, then we get:

10 intervals with a p99 of 400ms
90 intervals with a p99 of 1ms

Each of these intervals is coupled to a certain time so if the first bad intervals happens 24 seconds after the test starts, you can determine the absolute time and check for example the gc log.

Where it goes wrong: combining mean with percentiles

When you look at an interval p99 plot, it feels kind of normal to determine a the p99 based on the mean of these intervals. In this case the mean p99 would be:

(90*1+10*400)/100=40.9ms

This is an order of magnitude less than the of the actual p99 of 400ms!

The correct way to determine the p99 based on these intervals, is to merge each of the latency distributions into a single latency distribution, and then determine the p99.

In the above example we played with mean in combination with interval p99, but it can just as easily be applied if you for example have multiple machines generating load and you want to combine the data of each of these machines. If you determine the p99 using the mean of the p99 of each of these machines, you get the same calculation error.

I hope that after reading this post, you start to become worried as soon as you see percentiles being combined with mean.

Gil Tene

Gil Tene has an excellent presentation any self respecting performance engineer should have watched. Where he zooms into typical problems including coordinated omission, percentiles and all kinds of statistics errors including combining mean with percentiles. Gil also has written the excellent HdrHistogram tool to track for example latency.

Simple thoughts & complex stuff

vrijdag 4 augustus 2017

Percentiles and mean

The test data

p99

p99 per 1 second interval

Where it goes wrong: combining mean with percentiles

Gil Tene

How does hyperthreading work.

Misbruik rapporteren