Seeking and finding correlations — co-occurrences between two or more factors — is a key tool in the root cause analysis arsenal. It can lead you to possible causal factors to investigate. But many novice problem-solvers make the mistake of assuming that correlation is causation, that the mere fact of a temporal relationship between factors and symptoms is sufficient to establish the cause.
But correlation is not causation. As an example, consider this: It is a well-known fact that 99.9 percent of all crimes are committed by people who were wearing shoes at the time. But does this mean that, if we banned the sale or use of all shoes, the crime rate would go down?
It is also a fact that people in bathing suits show up on the beach at the same time that flowers in the garden begin to bud. But do the bikinis cause the budding? No.
Both of these correlations are true — these factors do co-occur — but neither is valid as a cause. Something is missing, and what’s missing is the “mechanism of the cause,” the logic that describes how the purported cause causes the observed effect.
One area in which we humans are particularly tempted to make the leap from correlation to causation is around time. Event A occurs, then Event B occurs at the same time or right after it. In situations like this, it is all too easy to go from “A happened just before B happened” to “A must have caused B.” For example, in a recent power blackout in New York City, over 100 people called to confess: “I’m sorry, it was me. The second I turned that darned hairdryer on, the whole city went black. It says it only takes 750 watts, but that can’t be accurate.”
This is clearly silly. But the close correlation between the two actions is so seductive that we tend to jump to cause. It rained on the two days we had loose tablets? It must have been the rain that caused the looseness (even though it also rained on five other days when the tablets were fine). Technician Harry was present on every shift when we produced the defective controllers? Harry must have been involved, somehow (even though he was also present on dozens of days when the controllers were fine). Correlations in location can be tempting, but co-occurrences in time seem to trump our sense of logic quite regularly.
We see this quite often in our root cause analysis work with clients. People look for factors that occur at the same time, or to the same extent, as the deviation. Does this have anything to do with environmental conditions? Since the problem tends to occur on second shift, does it have anything to do with crewing and differential skill levels? It seems to us that when we run Product A and then run Product B on the same line right afterward, we have high bio-burden on the run of Product B.
By itself, this is fine. The problem is that many people stop here and just say, “It was the environment … It was operator error … It was some kind of cross-contamination issue.”
So when you hear these claims of correlation, understand that you are close — you have found a pattern that might have some explanatory power. But also understand that you still have one step to go — you need to clarify the mechanism of the cause.
In the shoes case, we might speculate that perhaps some people wear shoes that are too tight, and that the tightness makes them angry, and that anger manifests itself as crime. How probable is this? Not very. But it’s clear enough to be testable. In the power outage case, we can simply look at the logic of the situation: I’ve used this same hairdryer at the same time every day for more than two years without causing a blackout — how did today’s incident cause the blackout when the other incidents did not?
We hear this all the time — “There must be something about the new supplier of Compound K, because the day we switched to that, our yields went down.” And that’s as far as it goes. But it needs to go further. In a recent case involving a bio-pharma process that dropped in yield by 22 percent, almost overnight, just such a supplier change was present. “It’s the new soy oil,” they declared. “It must be.” Sure. The decrease in production did correspond in time to the change in suppliers. But that’s not enough.
What is it about that new soy oil that’s causing the drop in production? The key fact in that case related to when in the process the production dropped. Did it drop during the growth phase, when the bugs are multiplying? Did it happen in the production phase, when those fully grown bugs are now producing antibodies? In this case, it was the growth phase that had dropped; once past that, the process continued to produce as much protein as expected, as a percentage, for that smaller number of organisms.
When we pushed the discussion further and asked, “What about the soy oil aids in the growth of organisms?,” the answer was “proteins — the higher the level of protein in the soy oil, the greater the production.” And when armed with this logic, we checked the protein levels of the old soy oil against the new soy oil, and it turned out that the new oil had less protein, over 20 percent less, which explained how the change in suppliers had caused the problem.
So the next time you hear someone positing a correlation as a cause, ask about the mechanism — “I can see that that these two things co-occur, but how does this change cause the specific deviation we are seeing?” If you can answer that, you have a real cause you can test.
What’s your take? Please feel free to leave a comment below! For more information, please visit www.kepner-tregoe.com.