18 Jan 2015

Averages are not good enough (F#)

Let’s (no pun intended) look at a set of response times of a web service.

let responseTimes = [ 18.0; 21.0; 41.0; 42.0; 48.0; 50.0; 55.0; 90.0; ]

People like having a single number to summarize a piece of data. The average is the most popular candidate. The average is calculated by dividing the sum of the input elements by the number of input elements.

let average input =
	let length = input |> Seq.length
	match length with
		| 0 -> raise <| new ArgumentException("Input sequence is empty")
		| _ -> (input |> Seq.sum) / float length 	

// Average = 45.625
// (18 + 21 + 41 + 42 + 48 + 50 + 55 + 90) / 8

The average is a measure of centre which is fragile to outliers; one or two odd irregular values might skew the outcome. The median on the other hand is always representative of the centre, not just when the data distribution is symmetric. The median is determined by sorting the input elements and picking the one in the middle.

let (|Even|Odd|) x =
	if x % 2 = 0 then Even
	else Odd

let median input = 
	let sortedInput = input |> Seq.sort
	let length = input |> Seq.length
	match length with
		| 0 -> raise <| new ArgumentException("Input sequence is empty")
		| 1 -> input |> Seq.nth 0
		| _ -> match length with
			| Even -> ( let first = sortedInput |> Seq.nth (length / 2 - 1)
						let second = sortedInput |> Seq.nth (length / 2)
						(first + second) / float 2)                             
			| Odd -> sortedInput |> Seq.nth ((length - 1) / 2)
				
// Median = 45.0
// 18 21 41 42 48 50 55 90
//          __ __

Both measures are terrible at showing how the data is distributed though. The average and median response time might look fair, but maybe there are a few outliers which are giving a few customers a bad time.

Instead of reducing our input down to a single number, it might be better to extract a table that displays the frequency of various outcomes in the input.

let frequencyDistribution input =
	input |> 
	Seq.groupBy (fun x -> 
		match x with 
		| x when x < 30.0 -> [ 0, 30 ]
		| x when x < 70.0 -> [ 30, 70 ]
		| x when x < 90.0 -> [ 70, 90 ] 
		| _ -> [ 90, System.Int32.MaxValue ] ) |>
	Seq.map (fun (x, y) -> x, y |> Seq.length) 

// Frequency Distribution = seq [
// ([(0, 30)], 2); 
// ([(30, 70)], 5); 
// ([(90, 2147483647)], 1)]

Now this is more useful; we can clearly see that the data is not distributed equally and there are a few outliers in our response times we need to investigate further.

This table takes up quite a bit of ink though. What if we want to get rid of the table, but maintain a feel for the distribution of the data?

The standard deviation measures the amount of variation from the average. A low standard deviation means that the data points are very close to the mean. A high standard deviation indicates that the data points are spread out over a large range of values.
It is calculated by taking the square root of the average of the squared differences of the values from their average value.

let standardDeviation input  =
	let avg = input |> Seq.average
	let x = input |> Seq.map(fun x -> System.Math.Pow(float x - avg, float 2)) |> Seq.sum
	let y = input |> Seq.length |> float
	let variance = x / y
	System.Math.Sqrt variance
	
// Average = 45.625; Standard Deviation = 20.87425148

The standard deviation is even more useful when you put the average at the centre of a graph, lay out the input elements according their deviation of the average and see a bell graph drawn. This means that we can use the empirical 68-95-99.7 rule to get a feel of how the data is distributed.

In statistics, the so-called 68–95–99.7 rule is a shorthand used to remember the percentage of values that lie within in a band around the mean in a normal distribution with a width of one, two and three standard deviations, respectively; more accurately, 68.27%, 95.45% and 99.73% of the values lie within one, two and three standard deviations of the mean, respectively.

For our set of response times, this means that 68.27% of the response times lies within the 24.8 and 66.4 range, 95.45% lies within the 4 and 87.2 range, while 99.73% lies within the -16.8 and 108 range.

When we calculate the standard deviation, we can put one extra number next to the average and derive from just two numbers how the data is distributed.

In conclusion, the mean and the median hide outliers. Looking at the frequency distribution gives you a more complete picture. If we insist on having less data to look at, the standard deviation and the 68–95–99.7 rule can compress the same complete picture into just two numbers.

code

Jef Claes

Averages are not good enough (F#)