For the Soul
Harvest time—vegetables fresh from the local farmer’s market going right into a homemade soup, eaten with fresh bread. Yum! This month, I’d like to talk about a recipe for a Greek alphabet soup, which you had better know about if you plan on doing an experiment. This may be the first recipe that might save your bacon, rather than use it up.
What in the world am I talking about? Alpha, beta, delta, and sigma—that’s right: sample sizes. In my experience, this is a very frequently misunderstood topic that can lead to very bad decisions.
Let me give you an example.
Once upon a time, I was standing in line in a fast-food joint near a major client. A couple of engineers who worked for the business walked up and said, “Hey, you’re one of those consultants, right?”
It’s never a good idea to answer this question “Yes,” but throwing caution to the wind, I said, “Maybe.”
It turns out these guys were in charge of making a change to their process that would result in some nice cost-savings for the business. There was a concern that these changes might result in changes to the strength of the product.
“So we set up the new process, pulled ten sample units, and then tested the strength. We did a t-test and found no significant effect. So this means that we can safely make the change, right?”
“Well, maybe, maybe not.”
After grousing about how consultants ought to be ahead of the lawyers when the revolution comes, I got a word in edgewise.
“Listen, it depends on your sample size calculations. How did you calculate that ten units were needed?”
When one of the engineers wiggled 10 fingers at me, I knew that I had been right in not making any assumptions.
So rather than getting anything to eat, I sat down with them to do some work (a common occurrence at this customer).
“So,” I queried, “do you have historical evidence that the process is normally distributed?”
They said they did, so we could stick with a one-sample parametric t-test. (This test is pretty robust to departures from normality, but if we violated the assumptions we could always use the nonparametric sign test for location at the cost of more samples.)
“OK,” I said, “What did you choose for your alpha error?”
“Ah ha! I know this one,” said the one on the left. “We chose 0.05. We want to run only a 5 percent risk of concluding there is a change in strength when there really isn’t.”
“Good,” I said, “And what did you choose for beta?”
Blank stares all around.
“OK,” quoth I, “What amount of change in the average strength would you want to be able to detect if it is there?”
Back into a comfortable space, they said that if the shift in average was more than 500 psi they would want to know about it—any less wouldn’t be big enough to care about.
“All right, we will use 500 psi as our delta, also known as effect size. Now if the shift in average is exactly 500 psi, what risk do you want to take that you miss it?”
After the requisite, “0 percent” and “OK, then use an infinite sample size” conversation, they settled on a beta of 5 percent.
“Right. Now, what is your historical standard deviation?”
Again they knew that one, so sigma (the statistical sigma, not the “sigma” index) was 1500 psi.
“All right, let’s sum up.” And I wrote the following on a napkin:
|α (type I error)
|β (type II error)
|Δ (effect size)
|σ (standard deviation)
“So if that is what you want, and since we can’t make the assumption that the standard deviation is going to remain the same at 1500 with the new process, you needed to run…38 samples, not 10.”
At that point, you might have expected them to be shocked or chagrined, but no, both of them got a very insufferable look of superiority.
“Ahh, but you have forgotten that we did the t-test and found no effect, so we didn’t need to run that many after all, did we?” Then they high-fived each other for their cleverness in getting away with fewer samples than the consultant said.
I hate it when people start off a sentence with “actually” but there was no choice in that case.
“Actually, that is the problem. With a sample size of 10, if there were a shift of 500 psi, you would only detect it about 12.9 percent of the time. Basically, if there really were a shift of that amount, you would be able to detect it more frequently by flipping a coin than by doing the testing you did.”
Ooooo, they didn’t like to hear that. The cold sweat popped out of their foreheads.
“Wha…what do you mean?”
I told them that when you do a statistical hypothesis test, you can make two errors. Type I error means that you incorrectly conclude that there was a change. Type II error, though, is missing a change that is there. If you set up an experiment and don’t understand beta error, you’re likely to settle on a sample size that has a large chance of missing the real effect of the size you are interested in detecting.
“You can’t prove that there is no change, you can only say that, as far as we can tell, there is no change above that due to chance given the sample size (and the other assumptions).” (Aspiring consultants, you must learn how to speak in italics and parenthetically.) “So, while you did not make a Type I error, if you did affect the process average by delta, you made a beta error. Or,” I paused, “you might have correctly concluded that there was no change. One or the other.”
On another napkin (the store manager was beginning to get annoyed, especially considering I hadn’t bought a sandwich yet) I drew this:
Change of at least Δ
(1 – α or confidence)
β or type II error
(α or Type I error)
(1 – β or power)
“See, once you have made a decision to accept the null hypothesis that there is no change, you are obviously either right or wrong. The problem is that if there really were a change in the average, you didn’t run a good chance of detecting it because your sample size was too small.”
They then asked how big the shift would have been to have a 95 percent chance of detecting it.
“About 6250 psi.”
“But a shift that big would throw us out of specification!”
“Well, we don’t know if there was a shift or not, but had there been a shift of 6250, we would still have had about a 5 percent chance of missing it. Smaller shifts of course would be increasingly harder to detect.”
Now you are reading this and thinking that I am going to tell you that we then ran the larger sample size and found that there was a big shift in the average. Truth is, it wouldn’t even matter if they found no statistical difference. They could have made the right decision in concluding with a sample size of 10 that there was no change in the average. Even so, they were making a change to the process that could have resulted in huge amounts of scrap and placed the company at high risk if they were wrong, all based on a very slim chance of detecting the size of the shift they wanted to see if it were in fact there.
What is the moral of the story? Eating homemade soup is always better than fast food—your clients don’t interrupt your digestion. Or maybe: Never write an article while hungry.
But also, that before you even start collecting data, you need to have done the sample size calculations, using the Greek alphabet as ingredients. Accepting the null hypothesis doesn’t in any way prove that there is no change, it only means that you didn’t generate data that significantly differed from what you would have seen given sampling error and the assumptions of the test. There are sample size calculations for any test you are running, and you need to understand alpha, beta, and delta for them all.
If you don’t do your sample size calculations, you run the risk of having false confidence that the process is unchanged, or missing the fact that a solution you tried out actually worked and so waste time looking for another one.
That could be worse than forgetting to add garlic to your soup.
But I could be wrong.