Be scared of the myth of big data

Last night I attended a lecture by Yuval Noah Harari – historian and author of the popular book ‘Sapiens’. Harari’s thesis is that human society is built on shared myths, and that without these we wouldn’t be able to organise ourselves into groups of more than a couple of hundred people. These myths are things like religion, social caste, political ideologies, and money.

During questions a member of the audience asked Harari what he predicted the next great myth would be. He answered, “Data.”

Harari’s contention is that with the growth of big data we are moving towards deifying quantitative information. Just as money has become something in which we unanimously place our trust (and therefore grant great power to otherwise valueless slips of paper) so we will begin to place our faith in data.

I can see signs of this myth emerging already, and I think it goes something like this: “if we get enough data we will be able to predict the future.”

The problem is that we won’t. There are some things data cannot tell us; there are limits to its power. Bigger sample sizes can take us so far, but there are certain frontiers that no sample size can help us cross. My fear is that if the data myth grows we will increasingly find ourselves basing decisions on statistical fallacies, and in a false sense of security end up with all of our eggs in a very unstable basket.

There are four reasons this myth is wrong:

The way we use statistical significance is logically flawed – so we cannot trust our results
Many social scientists use statistical significance tests to answer the question “Given the hypothesis is true, what is the probability of observing these results?”. However the question it should be used to answer is “Given these results have been observed, what is the probability that the hypothesis is true?”. Though similar, these questions are fundamentally different.

Ziliak & McCloskey (2008) liken this to the difference between saying “Given a person has been hanged, what is the probability they are dead?” (~100%) and “Given a person is dead, what is the probability they have been hanged?” (<1%). Although these questions sound similar they give completely different answers; and we could be using our statistical significance testing to make mistakes as big as these.

The laws of societies are not fixed – so we cannot predict the impact of our actions
We use data to estimate parameters about society and the economy, such as the relationship between inflation and unemployment, or between income inequality and crime. Although we can measure the parameters of these relationships at the moment, these parameters are not fixed. In fact they are highly prone to change whenever we alter something like technology or government policy.

So for example we cannot predict the impact of a new invention on society, because our prediction would be using parameters from the pre-invention world and not accounting for the invention’s impact on the deeper structures of society. This means that the times we most want to use data to predict the future – those times of significant change – are precisely those times when to do so would be utterly invalid.

No amount of data can capture the complexity of human systems – so we cannot make predictions beyond very short time horizons
Non-linear systems suffer from what mathematicians call “sensitive dependence on initial conditions”, popularly known as the butterfly effect. In a linear system measurement error is not a big problem. As long as a measurement falls within reasonable bounds of error we can make predictions within similarly reasonable boundaries because we know how much the error can be magnified. In a non-linear system, however, measurement error, even if utterly minuscule, can completely dominate a prediction. This is because the feedback loops in such a system continually transform and magnify the error until the resulting behaviour of the model is totally divorced from that of reality.

Human systems are so complex that we cannot measure them accurately. There will always be a measurement error, no matter how much data we obtain. They are also extremely non-linear. And this means that our predictions will quickly deviate from reality.

We don’t know how to handle uncertainty – so we cannot forecast probabilities
Our forecasting models are built on probabilities. We manage risk by assigning probabilities to all possible outcomes, based on historical data. What we can’t do is manage uncertainty. Uncertainty is different to risk because it describes a situation where the possible range of outcomes and/or their probabilities are not known. If we don’t know the probability of an outcome, or we don’t even know what the outcome is, then we can’t build it into a model. And if our models only take a subset of possible outcomes, and assume that the probabilities of the past are unchanged in the future, then the probabilities they forecast will be wrong.

4 thoughts on “Be scared of the myth of big data

  1. ollieorange2

    Statistical significance testing is not logically flawed, it was recently used to prove the existence of the Higgs Boson. *Some* Social Scientists misunderstand the logic behind it but that’s not the same thing at all. There are some very prominent Education Researchers who want statistical significance testing banned in education research altogether, which is why half of the EEF reports don’t have statistical significance tests. This is blog is very irresponsible. People will read it and believe it.

  2. Pingback: Be scared of the myth of big data – David Thomas’ Blog | The Echo Chamber

Leave a Reply