The power of the central limit theorem
Throughout the last couple of articles, I have explained and illustrated that understanding the random sampling distribution (RSD) of a statistic is key to understanding the entire basis of inferential statistics. Which is just a fancy way of saying “avoiding career-terminating decisions.” This month I’ll show you how the central limit theorem is your best friend, statistically speaking.
As I have mentioned before, there are four characteristics that we need to know (and often test) about any data set: shape, spread, location, and behavior over time. In the article, “The Omnipotence of Random Sampling Distribution,” I showed you the RSDs for some of the statistics we use to measure shape, spread and location. (The behavior over time is what control charts monitor.) In my last article, “(Sample) Size Matters,” we made an assumption about the shape of the population we were testing and said it was normally distributed. Although there is a strong theoretical basis for the normal distribution showing up, it certainly is not the only distribution you will see in the real world. So what happens if it is some other distribution?
Let’s say we are interested in the population in figure 1:
Figure 1: Slightly skewed population
The skewness and kurtosis numbers are tested against those expected from a sample size of 450,000 and are highlighted in yellow if the probability of getting that statistic is less than 5 percent, thus rejecting that the normal is a good approximation. It is moderately skewed and leptokurtic, so it is certainly not normal, and we wouldn’t want to approximate it as normal, say for purposes of a capability study, because we would get the wrong results, as illustrated in figure 2. As we see there, the UPL and LPL are where you would say the process natural tolerance is if you assumed normality, but they would clearly give you incorrect probabilties:
Figure 2: Bogus normal limits on the skewed distribution
Now there is nothing wrong with a non-normal distribution—it is the nature of this process and might even be preferred to a normal distribution. But what if we were interested in testing to see if a new vendor might be able to shift that average up by five points? To calculate the sample size, we need to know the RSD of the means for this distribution, as I described in “(Sample) Size Matters.”
Obviously, the mean of all possible means of size n is going to be the mean of the individuals—it’s just the same numbers rearranged. But what shape will the distribution of means be? It turns out that as the sample size increases, the distribution of the means gets more and more normal. So with non-normal distributions, the sample size needed to detect the change in the average, which we are looking for, also has to be large enough so that the RSD is reasonably approximated by the normal distribution. (If you can’t get that many, you can always use a nonparametric test like the sign test for location at the cost of some power to detect the shift.)
How big does the sample size used to calculate the average need to be for the RSD of the means to be normal? The statisticians’ favorite answer… it depends.
For moderately non-normal distributions, five, 10, or 15 samples should be just fine. For the most extreme distributions, you may need more.
In our example above, figure 3 shows what happens to the skewness and kurtosis of the RSD of the means as the sample size goes up:
Figure 3: What happens to the skewness and kurtosis of the RSDof means as the sample size goes up
As you can see, pretty quickly the kurtosis is eliminated as a problem for using the normal approximation. Skewness lingers on in ever-decreasing amounts. However, we are failing the statistical tests at a pretty massive sample size. Is the normal approximation a reasonable (if not exact) model for this RSD? Let’s look at normality tests on a random sample of 1,000 from this RSD, as shown in figure 4:
Figure 4: Normality tests on a random sample of 1,000
So within 15 samples, even 1,000 data points look normal (though of course, the RSD really is still a teensy bit skewed).
Figure 5 shows a histogram at n = 15 for those 1,000:
Figure 5: Random sampling distribution of the mean of sample size 15 for the skewed population with the normal distribution superimposed
The upshot is that if I am making a decision based on assuming the RSDs of the means are normal, I probably am not far off.
What about the most extreme example?
Here in figure 6 is a population that is exponentially distributed:
Figure 6: Exponentially distributed population
(You run into the exponential distribution fairly frequently in real life. For example, a serious injury rate might follow a Poisson distribution, and if so, the time between serious injuries would be exponential.)
You can’t see it, but with a sample of 1.05 million, the high is 153, so that is pretty skewed. The average is 10, as is the standard deviation. (Weird fact: An exponential distribution with a lower bound of zero has a standard deviation equal to the mean.) Obviously, assuming this is a normal distribution would be a bad idea (see figure 7):
Figure 7: Exponentially distributed population totally bogusly approximated by the normal distribution
It would be a bad idea because I would have a fairly high proportion on negative time between injuries, which I guess means that injuries are occurring before they happen. This means either I have invented a dangerous time machine, or I screwed up the assumption of normality.
To get an RSD of the means that is normal, we are going to have to take more than five samples, I bet.
In figure 8, we see a similar pattern for the RSD of the means as we did before:
Figure 8: A similar pattern as before for the RSD of the means
I even made an animation (in figure 9) to watch the changes in the shape of the RSD as the sample size increases. (OK, I am so totally a stats geek.)
Figure 9: The effect of increasing sample size on the RSD of the means from an exponential distribution
Let’s do a similar exercise as before. Let’s take 100 random samples off of the RSD of the means for different sample sizes and test for normality. (If we take 1,000 like before, nothing passes the skewness test.) We get the following results for distribution shape (see figure 10):
Figure 10: Results for the distribution shape
Yellow boxes fail at α = 0.05, which is the Type I error I advise for distribution shape testing. So for these random samples, we pass the normality tests when we calculate the mean from 10 samples. That is probably too low of a sample size to be reliable, though, so I would take means from 20 or more samples in real life. Usually by a sample size of 30, pretty much any distribution’s RSD of the means looks normal enough to be helpful.
Here in figure 11 is the distribution of the population and the RSD of the means with n = 20:
Figure 11: The exponentially distributed population and its RSD of the means of 20 samples (to scale)
If I was to take more than 100 random samples from those RSDs, we would still see remnants of the skewness, which would result in failing the tests for normality, first on the RSDs of the smaller sample sizes and then on the larger ones. But as George Box famously (famous among statisticians, anyway) said, “Essentially, all models are wrong, but some are useful.” If an RSD of the means is normal enough to pass the skewness and kurtosis tests with 100 samples, it is probably close enough to be useful in making decisions.
For example, if I am interested in finding out if my project has significantly decreased the average time between serious injuries, I can use the relationship of the RSD of the means back to the actual population average to find out.
Because the averages of n = 20 fall on a nice (almost) normal distribution, all I have to do is take a single sample of 20 (presuming 20 is a large enough sample size to notice the effect size that I am looking for) and test that mean against the original mean using a t-test, even though the individuals are distributed as an exponential.
Check this out. The final prediction of the central limit theorem is that the standard deviation of the RSD of the means is related to the standard deviation of the individuals like this:
The standard deviation of the population was 10, so we would expect to see a standard deviation of the RSD of the mean like so:
And we see 2.2107 with a sample of 15,000. Not too far off for a population that started off as exponential. Larger sample sizes, of course, end up being closer to the theoretical value.
This is massively useful in real life, since we do often encounter non-normal distributions, but we still have to make decisions based on samples from them. Of course, if the population could be normally distributed, we first check to see if that is a reasonable approximation. But even if it is not, the power of the central limit theorem is there to help us make reasonable decisions.