Data Modeling’s Great Debate

I find that as industries more readily seek correlations and attempt to build models, the arguments are more frequent and even more heated.

Alan Nicol, Executive Member, AlanNicolSolutions

Jan 27, 2012

Though using models based on data continues to grow in practice, an old debate continues to rage. Some would use models based on strong correlation and forego the often-difficult quest to establish cause, while others insist that models that are not based on cause and effect are invalid.

I am encouraged by the long-lasting trend by which businesses and design teams are using models based on data to predict outcomes and control processes with ever-greater confidence and frequency. I’m a strong believer in the power of modeling data to better prepare or influence our businesses.

I admit, though, that in my first few engagements into the argument over whether models should be used based solely on correlation, or whether cause absolutely must be identified, I convinced myself that as industries became more proficient with data modeling the argument would die and I wouldn’t need to convince people that correlation isn’t enough anymore. My crystal ball was wrong.

Instead, I find that as industries more readily seek correlations and attempt to build models, the arguments are more frequent and even more heated. I don’t have an explanation for why, but I’ve had the debate so many times that I do feel I can share some concise insight to help our readers the next time they are party to it.

First, however, let me clarify what I mean by the debate between correlation and cause. The classic example, which may be overused but is still a clear place to start, is the urban myth data model of shark attacks that shows that when ice cream sales are high so are shark attacks on beachgoers. This is correlation. It is a statistically significant, demonstrated relationship between one data factor and another.

If A goes up and B coincidentally goes up, then they correlate. Likewise, if A goes up and B coincidentally goes down, the two are “negatively” correlated, but correlated none-the-less. However, correlation does not necessarily mean that the two have any influence on each other. Ice cream sales do not influence shark attacks, nor do shark attacks influence ice cream sales. They are correlated, but there is no causality.

The cause, in this case, is common, but it is not part of the model. In the case of ice cream and sharks, the cause is hot weather; when it is hot outside more people buy ice cream. Similarly more people are at the beach and sharks coincidentally feed closer to shore.

The debate to which I refer occurs when someone decides that they want to use ice cream sales to predict shark attacks. If the correlation is strong and reliable, then even if it doesn’t model the cause it should still be good enough to predict a rise or fall in shark attacks. A stickler for data modeling precision would insist, however, that making predictions based on correlation only, without understanding cause, is too prone to error and just poor practice.

I recently read an article that described how scientists who were proactively trying to get ahead of the H1N1 flu outbreak clearly must have run into this very dilemma. They were trying to predict how the virus might move through the United States and where they needed to distribute the limited supply of vaccine to best mitigate the spread of the flu.

The scientists used the Where’s George project data set to predict the movement and spread of the virus. The Where’s George project attempts to model the movement of people within the U.S. by tracking the location of millions of one-dollar bills as they migrate across the country. It’s actually a very clever idea. After all, in order for a dollar bill to move from one place to another it must be physically carried by people and traded between them.

The scientists used the Where’s George model of money movement to model the movement of the H1N1 flu virus. I say they must have run into the debate between correlation and cause. These are scientists who should not be timid about data or data modeling, yet they chose to use a model that had no relationship to the flu itself to estimate the spread of the flu. Surely the discussion must have taken place.

The results of their gamble turned out rather predictably when I think about it in retrospect. The model excellently predicted the movement of the flu in the U.S. However, the scientists underestimated the rate at which it would spread as it moved. Naturally, a dollar bill does not replicate itself as it moves, but the flu most certainly does. Just how many people a carrier might infect was anyone’s guess and the Where’s George model itself would offer no help to estimate that part of the scientists’ problem.

We see that even data savvy, data loving scientists will hazard predictions based on unrelated factors. It happens a great deal in business as well. I’ve even had the debate with engineers developing systems and products. So what are the reasons for accepting correlation and taking our chances, and just what chances would we be taking that establishing cause can remove?

Reasons Why People Accept Correlation

There are really only two reasons why people choose to make predictions or manage process control based on correlation, without building a model based on genuine cause-and-effect factors. The first is time and money. The second is a lack of understanding.

Sometimes, the cause of a phenomenon is not understood. This happens in the medical industry where the cause for certain disease or illnesses may not be attributable to any measurable data source. In business or engineering, it tends to be more a matter of a willingness to invest the time and money necessary to discover or model it. Market data is particularly difficult. Modeling human behavior is especially elusive.

Clearly the scientists in the example above had this problem. They were pressed for time and needed a reasonably assured way of predicting the movement of a disease immediately. They didn’t have the luxury time to try and build one. They had to accept the risks of using a model that was not actually related to the disease. I can’t blame them. I thought their solution was rather insightful.

Otherwise, people choose to make decisions based solely on correlation because they do not understand the risks, and choose not to try to look deeper for cause-and-effect relationships. If this is the case, then we owe it to our peers, our leaders, the business we work for, and ourselves to educate these decision-makers and convince them of the potential of their mistake. Share with them the following.

Reasons to Seek Cause and Effect

The reasons to seek cause and effect are more like reasons not to rely on correlation alone. Here is the short list:

Without a cause-effect relationship, the relationship between two outcomes or factors may not be stable; it may change unpredictably.
Cause-effect relationships can reliably anticipate the quantitative response.
The influence of a cause can be quantified: a model that accounts for 99.7 percent of all influence on the outcome can be highly reliable.

Let me explain each of these a little further. We can refer to the examples already given above.

Correlation alone will not explain a relationship between the predictor (ice cream sales) and the outcome (shark attacks). It is purely coincidental. Therefore, we do not have any confidence that it will continue to be correlated in the future. The coincidental relationship may, or may not be stable.

For example, suppose that the cost of ice cream suddenly drops, or a new process suddenly makes it even better, or a certain teen celebrity declares that all of his talent comes from eating ice cream. Any of these things could increase ice cream sales, but they would have zero effect on shark attacks. Someone using ice cream to predict shark attacks would raise a false alarm.

When we do have a cause and effect relationship between a predictor and an outcome, then we can often establish a predictable, quantitative influence that the one has over the other. Suppose that the scientists trying to mitigate the spread of the flu actually had a relationship for the rate of the spread of the flu as it moved across the country. They would have had much better foreknowledge of the amount of vaccine needed in strategic areas.

As it was, they had little to work with and they underestimated. In production processes and engineering, we can often establish the relationship and contribution a factor can have on an outcome. This allows us not just to say, “When A goes up B will also go up.” It allows us to say, “When A goes up by X, B will go up by Y.” When we have such a model, we have a very powerful tool at our disposal. Without it, we don’t really know if B will even respond when A changes. We certainly don’t understand how to manipulate B if A is only correlated and is not a cause or contributing cause.

Finally, the statistical significance of a cause can be established as it relates to its ability to predict or control the outcome. Said another way, if we go beyond just correlation, and establish cause, then we can statistically determine how much control one or more causes have on the outcome. Let’s look at shark attacks.

Instead of using ice cream, we decide to use temperature. We can compare the rise and fall of temperature to the number of shark attacks. Suppose we determine that 50 percent of the changes in shark attacks can be explained by changes in temperature. We know our model probably needs some improvement. Suppose that when we add in the day of the week (weekend vs. weekday) and month (to account for shark migration patterns maybe) we find that our predictor factors collectively account for 87 percent of the behavior of the outcome. We now have a much better model and it is much less likely to suddenly change behavior in the future because we know we have most of the controlling factors identified.

Some things are easier to model than others. Chemical reactions and mechanical relationships of machinery are very predictable and stable once the causing and contributing factors are all accounted. Human behavior is rather inexplicable and a model that explains 60 percent of the outcome of human behavior is really good.

No model will be perfect. All models have error. The key is that when our models include a cause-effect relationship, we know what that error is. I must caution us, however. All models are based on historical behavior. The future is outside of that inference space and, therefore, there is no guarantee that our models will always predict the outcome. Modeling the effect of cause, though, is a much better way of hedging our bets.

Bottom Line

Sometimes we simply do not have the information, time, or money, to try and establish the cause of our outcome and to model it. There may be a few instances where the costs of a mistake based on correlation are less than the cost of trying to model a true cause-and-effect relationship. In these instances, it may be necessary to accept the correlation and take our chances. This should be an exception, not a rule, and it should not be an excuse to be too lazy to build the right model.

Whenever possible, seek to establish genuine cause for your outcome and establish a reliable model with it/them. It is a much better modeling practice and your model will be more reliable, will be a better predictor, and will be more useful than any model based solely on correlation. Do not let a lack of understanding of the risks drive the decision to accept and use a model based solely on correlation.

Stay wise, friends.

If you like what you just read, find more of Alan’s thoughts at www.bizwizwithin.com.