This is the third post in our ‘eat your greens’ series – a back to basics look at some of the core concepts of statistics and analytics that, in our experience, are frequently misunderstood or misapplied. In this post we’ll look in more depth at the concept of the standard distribution.
I’m often struck by the fact that people seem to have an intrinsic sense of what a mean value is. The mean is what most people refer to the ‘average’. As children, we are taught that you calculate an arithmetic average by summing all the values in a group and dividing the result by the total number of individuals in the group.
Yet if an alien arrived on earth and asked, “what is ‘a mean’?”, most people would struggle to explain it. They might say “it’s the most typical value in a distribution of numbers”. To which the alien might reply “but surely that’s called a ‘mode’?”. Others might say it’s the most central value in a distribution”. Again, the alien might say “I thought that was more like a ‘median’?”.
The mean might best be described as the ‘gravitational centre of a distribution’. The problem is that although most people have a good instinctive grasp of what mean value is, that same instinct immediately evaporates when it comes to understanding what a standard deviation is. This is strange, as the standard deviation is just another kind of mean, but one which focuses on how much the individual observations vary, on average, from the actual mean in a data sample.
Because the standard deviation estimates the average ‘spread’ of values, in statistics its called a ‘measure of dispersion’. And it’s a pretty fundamental tool in data analysis, not least because if I state that the mean amount a large group of diners spend in a restaurant is £34, it doesn’t really tell you anything. It could be that 90% of people spend around £10 and the remaining 10% spend over £200. Or it could be that the minimum spend is £30 and the maximum is £38. In the first scenario the standard deviation could be around £74 and in the second, a mere £4. So, by telling us something about the average spread of values, the standard deviation is showing us additional useful information about the sample of data that mean is calculated from.
So why don’t statisticians just call it the ‘mean deviation’? The reason for this, is that there is another even more important aspect of this statistical measure. But to understand this property, we need to start thinking of the standard deviation as a unit of measurement. In other words, we need to start thinking in terms of how many standard deviations above or below the mean a value is.
If, for example, the mean is 30 and the standard deviation is 4, then a value of 34 is one standard deviation above the mean. Conversely, a value of 22 is two standard deviations below the mean. Why is this useful? Because the standard deviation as a unit of measurement can tell us, in a given distribution of data, how often we should expect to see values as extreme as say 22 or 34, or for that matter 44 or 12. In essence, it can tell us the probability that any value of a given magnitude will occur randomly in a particular distribution.
When I say, ‘particular distribution’, I’m actually referring to a normal distribution. Also called a Gaussian distribution, the normal distribution looks like a symmetrical, bell-shaped curve where the mean, median and mode all occur exactly in centre. A key property of normally-distributed data such as IQ scores, height and blood pressure measurements, is that even though the data values themselves might be in completely different units and cover a completely different range, the number of cases that fall within plus and minus one standard deviation of the mean is almost exactly 68%. Moreover, around 95% of the data values fall within two standard deviations of the mean (or 1.96 standard deviations to be exact).
This means that if a particular case has a value equal to two standard deviations above the mean, we can say that only 2.5% of cases are likely to have values as extreme as this. In fact, we can calculate how often any value is likely to occur based on the number of standard deviations above or below the mean it lies. This is what makes the standard deviation ‘standard’. It is a tool that allows us to slice distributions like a ‘probability knife’.
Using this same principle, statisticians have developed a whole portfolio of different distributions that, in a similar way, can be used to estimate the likelihoods of excessive insurance claims, child mortality, component part failures or passing a driving test. For those learning about data analysis, knowing how measures like the standard deviation can be applied to estimate likelihood, is the gateway to an understanding of how inferential statistics in general works.