We should treat data projects as we do scientific experiments

It is received wisdom amongst many ‘data gurus’ that correlation equals insight. And supporting them in their quest are plenty of analytics platforms to spot these correlations and hidden relationships and additional tools to blend data and identify further relationships.

The reality is that most data interrelationships are not that simple. Many correlations are based on false assumptions or ignore key factors. And as data gets bigger, more of these opportunities for error are introduced. The result is stories like this one claiming many big data projects fail to deliver.

Take a simple example of customer profiling. Your data may say your audience is most likely to use a particular social network, so you should spend your advertising budget targeting them there. But the data doesn’t tell you everything. Why are they there? Maybe it is because that is the platform where they want to switch off from work decisions, in which it may be the worst place to target them. A better-designed interrogation of your data could have told you that.

For inspiration on how to design data projects that actually can deliver, we should look to more complex, higher stake environments. Life science leads the way in the intelligent use of data. Where the output is a drug or treatment that could either make millions or be a costly failure; there is an absolute imperative to not just get it right but to get it right as soon as possible.

What life sciences appreciate, that many other domains perhaps don’t, is that there is a difference between observing correlations and actually understanding what the seemingly simple observations are telling you.

The copious amount of research data used in drug discovery and development is approached in the same manner as a scientific experiment. People who understand the subject matter and the information currency, design experiments which test how different decisions affect the outcome, then test whether there is a direct causal link.

A simplified example, initial experimentation may show a correlation between higher temperatures and yield from a chemical process. A data analytics black box may find this pattern and tell you to raise the temperature. A data scientist with knowledge of chemical processes will first isolate other factors to interrogate further. It may be that the increased temperature is changing the activation energy of a catalyst, so it is the catalyst, not temperature, that is controlling and governing yield. Therefore, adding more catalyst could be a far cheaper option than spending a lot of money augmenting pilot plants to raise reaction vessel temperatures.

Life science is used to such scientific approaches, and these approaches are also applicable to data and analytics projects. Having people who understand the meaning of the data, what it is you are looking for, and who can, therefore, design experiments to find meaningful, proven insights, not just spot correlations, will ensure that you are taking the best and most informed decisions. A black box data analytics platform will just see a pattern unless it is part of a project set up by subject matter experts to understand the relationships they are looking for.

Data has implicit knowledge but if you don’t understand what the data represents then any knowledge, and potential value, may go squandered.

As this work illustrates, when examining large data sets all sorts of weird and wonderful correlations will emerge. The bigger the data, the more room for misinterpretation and errors. The man on the street is expert enough to immediately dismiss the correlation between mozzarella consumption and Engineering doctorates as a fluke in the data. But being able to identify whether a correlation between temperature and reaction rate is a due to causal relationship, a third factor affecting both measurements, or just a fluke, requires expert insight and careful design of your data project.

For this reason, investigating data should be approached like planning a scientific experiment. Start with an educated hypothesis based upon experience, expertise or a promising potential lead. If you’re not an expert on the subject, get experts to advise whether it sounds feasible – if not dismiss it before spending money investigating it.

Next, identify what you want to investigate, excluding as many distracting variables as possible, so you are honing in on the information you need to make the needed better business decisions.

Where correlations appear, they should be investigated systematically and scientifically, to understand what the correlation may actually mean. A subject matter expert will be able to make informed assumptions about the validity of the relationship, or further experimentation may be needed to isolate specific variables and test whether the correlation holds up.

Only this way can you understand the implicit knowledge of data, and go beyond finding a pattern, to understanding whether the pattern is valid and what that pattern means. This will provide the reliable insights which are promised but rarely delivered by the data gurus. Meanwhile, those who make decisions based solely on correlations will only get it right a small percentage of the time.

Is your organisation willing to gamble solely on correlations alone?

Matt Jones

Matt Jones

Matt has over 16 years' experience of working in Research and Development groups within the ...

Latest Tweet from Matt Jones

best big data analytics project of the year #UKITAWARDS https://t.co/Ywt3R516Jb

about 1 year ago

© Copyright 2017 Tessella
All rights reserved