Deluded By Data

Why do we collect and analyze data? I believe that answering this question is critically important each and every time we step into a data analysis exercise. I say that because if we don’t keep our purpose foremost in mind while we are analyzing data, we stand a good chance of interpreting information that may not be real.

Nov 4, 2013

Analyzing process data is less dangerous than marketing data, sales data, employee satisfaction data, or other human behavior data, but there is still some risk of creating illusions with our data. Let me explain with a hypothetical example. We’ll examine sales data with intent to determine requirements for a new product.

Suppose, in this case, we want to determine what colors to choose for a new electronic accessory. Maybe it’s an MP3 player or a cell phone or something similar. Let’s say that there is a local retail store that happens to be very helpful to us with regard to collecting such information and the results of the data we can collect for similar devices sold at that store within the last year demonstrate that 50% of devices sold were black, 15% were red, 15% were blue, and 20% were white.

The easy and obvious thing to do is to write a product specification that mirrors the data; half of the new product devices we produce will be black, one-fifth white, and the rest blue and red. No problem. I have observed and participated in, and led or performed data analysis on a fair number of such data sets. Simply accepting such data at face value rarely (never in my experience) reflects actual performance.

Just because 50% of sales were black devices, doesn’t mean that 50% of buyers prefer black, or that 20% prefer white. I know, the data I gave says it does, right? I’m trying to clarify that it does not. Here is where critical thought becomes terrifically important. Let’s consider a few basic questions we might ask about the data.

Did the sales data include only our own brand of devices, or all similar devices by all competitors? Were all of the devices displayed on the same aisle or in the same area? Did the ratio change over time or was it consistent for the whole year?

Why are these questions important? Well, if the devices examined came from multiple manufacturers do we know if some of them were only offered in black? Did color affect the buying choice at all? Did color drive the buying choice away from potentially superior devices because some people didn’t want black? Were certain devices of a particular color displayed in one place while others were displayed somewhere else? Were one brand of devices, only offered in black displayed in a high traffic area, while competitors were displayed elsewhere?

The point is that the results of sales do not reflect or interpret the spectrum of potential causes for the decisions to buy. Therefore, the data may or may not reflect the information we really want to know.

Here is one phenomenon that I witnessed while examining the sources of sales data by shadowing a sales representative checking on a display for a new product in a retail store. The store’s floor representative opened a box of product and filled the space on the shelf, then opened an second box of product with a different appearance and finish and filled up the remaining space on the shelf. Other boxes of product with different appearances didn’t get put on display.

The sales representative I shadowed threw a bit of a fuss and almost immediately had the department manager in the aisle with the floor clerk fixing the display with a representative sample of each product appearance and finish . We left hoping that by making a big deal about it, the problem would be fixed, but the sales person vowed to come back and check.

Imagine the data that would return at the end of the sales quarter if the display had been left as the floor clerk had stocked it. We might have seen data that demonstrated that 75% of sales were finish A, 25% finish B, and no sales for C or D, exactly the ratio of product on the display. Of course, the retail store would restock and re-order by-weekly or monthly based on what sold and would restock based on what sold, so the sale of finish A and finish B might perpetuate perfectly according to what was on the shelf with no regard to actual buyer preference.

In fact, in digging critically into sales data, I observed exactly that phenomenon to occur. The problem is that digging that critically into data is exhausting and expensive. The investment can quickly exceed the value of the data. With that concern in mind, many analysts choose to simply accept data at face value.

I’ve seen the results of that too. I’ve been the victim of those results. I expect that readers too have been handed product requirements that are a ridiculous combination of features and functions. When we challenge the requirements we are referred to data that shows how competitive products have these features so we must produce them too. That is the indicator of delusional data analysis.

Just because a product has a feature or function, it doesn’t mean that the feature or function in question is the reason the product sells. It’s the age-old challenge of distinguishing correlation from cause.

I’ve run into this challenge will many forms of behavioral data. An analysis of sales data suggested that a business should stop producing a rarely purchased option. However, eliminating the option also eliminated sales in other products because they were purchased in combination. The expected savings failed to manifest because of the small dent in revenues that accompanied the change.

Analysis of customer satisfaction data and employee satisfaction data is very difficult to perform and easily indicates courses of action that may not really make a difference. For example, one department of employees may be noticeably less satisfied than others. Satisfaction with direct managers also appears to be low.

It would seem that replacing the managers of that department might be the proper course of action. However, the department happens to be the customer service call center. Replacing the manager of that department could very easily reduce morale instead of improve it. Call centers are notoriously discouraging departments for employees because of the nature of the work and managers often get the blame for every form of angst, especially when it’s less expensive to deal with turnover than to invest in improving morale.

There are many, many ways of dealing with data that may or may not provide meaningful information. There are hundreds of data analysis methods. The trick is not finding appropriate methods, the trick is recognizing when we need to examine our data more critically.

By asking the simple question, “What do I need to know from this data?” and then following up with a critical examination question, “Does this data really tell me that, or is there cause we can’t see?” we can protect ourselves from the delusions of data. It’s the same practice of examining the null hypothesis in the science of statistics.

Let’s go back to the challenge we discussed above, picking colors for our new electronic device. What are we trying to determine? We want to determine what colors will maximize our sales. Does the data we have from the store tell us what colors drive customer decisions? It tells us what ratios of device sales occur in certain colors.

Compare the last question to the last statement. They are not the same thing. It’s subtle, but important. The ratio of colors currently purchased is not the same information as what color(s) will maximize sales. Would pink sell more than red and white combined? The data won’t tell us.

Because it can be a subtle challenge to determine if the data we have will provide the information we need, I find that it is very helpful to actually write each one down in very clear and careful terms. It helps me greatly to write very specifically what I want to know and then, separately, write down what the data available can or might prove (not what I want it to prove, but what it actually contains). When I write them down in clear terms, I can better see if the data is capable of producing the information I need.

I compared it to the discipline of the null hypothesis. In statistics, if we want to determine that black devices will sell better than blue and red devices, we seek to disprove the opposite. The hypothesis is that black devices sell more than red and blue devices combined. Therefore, the null hypothesis is that black devices do not sell more than red and blue devices combined. When the data we have both proves that black sells more, and disproves that black does not sell more, we have scientifically determined a statistical “fact.”

The problem with most of the data we get from sales, satisfaction data, or surveys is that we can draw correlation between certain features or functions, but the data is usually too noisy to disprove that the results don’t come from something else. Often, the results do come from other things than those we chose to examine. The interactions are simply too complex.

Save yourself the gross, but common, mistake of letting data fool you into making decisions that sabotage your intent. Don’t let data lead you to specifying “Frankenstein’s Monster” products. Be critical in determining if the data you have actually can or will provide the information you seek. Be sure that the correlation you see is really representative of cause, and not just a coincidence or an accidental result of the cause.

Protect yourself and your organization from delusions created by expecting data to tell you what you want or need to know by critically examining if the data truly represents information concerning the question you have. Ask yourself specifically what the data contains and, separately, what you need to know. If the two are different, don’t let the data make a decision for you. Instead of assuming the data you have will guide you, seek the information you need.

Stay wise, friends.

If you like what you just read, find more of Alan’s thoughts at www.bizwizwithin.com