whatsapptwitterteamslinkedinfacebookworkplace

Patterns and confirmations

The shape in the cloud

It might be a long time ago since you’ve done it, but we’ve all, at one point, looked up on the clouds and tried to make outfamiliar shapes in the cloud formations. You realise that these shapes are a combination of chance and your imagination. A similar type of seeing shapes or patterns can occur when we look at data.

One way of seeing patterns in data is known as the clustering illusion which commonly occurs when we look at scatterplots or maps: we see some observations which are close together and assume some underlying factor explaining their proximity while in reality the distribution was random. This illusion results from our distorted view on randomness whereby we excessively expect more equal distributions. A real life example can be found in Clarke (1946): On an area of 144 square kilometers of south London a total number of 537 German bombs fell during a certain period of World War II. People assumed there was some kind of pattern or clustering. Clarke divided the area in quarter square kilometer zones and counted the number of bombs that had fallen on each zone. While about 40% of the zones had not received a single bomb, 37% had one, 16% had two, 6% had three, 1% had four, and a single zone had seven bombs. He then calculated the probability of this observation (compared to a Poisson distribution) and could demonstrate there was absolutely no indication of clustering as it was a pattern that was absolutely not unlikely to occur under chance.

Figure 1: Blue dots represent bombs, and the numbers in each cell are the number of bombs in that cell. This is only 1/9th of of how Clarke’s original grid must have looked. People also tend to see patterns due to so called spurious relationships. Here we do actually have a statistically significant relationship between two variables however there is still no direct causal relationship in either direction. A well known example of this is the positive relationship between the amount of ice cream sold and the number of deaths by drowning. Eating ice cream does not increase anybody’s probability of drowning, so it’s not the cause, nor is ice cream typically eaten at funerals of drowning victims, so it isn’t the result either. The reason for this relationship is or course the fact that both drowning and eating ice cream are much more common with warm weather.

Lots of spurious relationships can be found by joining all kinds of variables using a common key, like the year of measurement. This is exactly what Tyler Vigen did in his book “Spurious Correlations” (2015). If you’re interested in the relationship between the per capita consumption of mozzarella cheese and the number of civil engineering doctorates, or other rather hilarious relationships, you should check out this book or his website (https://www.tylervigen.com/spurious-correlations).

Berkson’s paradox

If you’ve ever worked in academia you might have observed professors that were great teachers but whose scientific research was not the most overwhelming, while on the other hand you’ve got those who published in high quality journals on a monthly basis but were totally incapable of explaining what it actually meant to an audience of students and their classes functioned simply as an effective cure for insomnia. It’s almost like there is a negative correlation between quality of teaching and quality of research. What you observed is the effect of the Berkson’s paradox (Berksen, 1946). This paradox shows how a zero- or positive correlation could turn into a negative one (or vice versa) due to a specific selection. Let’s assume that there is – in general – absolutely no correlation between quality of teaching and quality of research; this is represented by all points (grey and blue) in the graph.

newsletter.svg

Never miss an insight

Stay updated on the latest articles, events, and more

Your email address is only used to send you the Keyrus newsletter. You can use the unsubscribe link in each newsletter sent at any time. Learn more about the management of your data and your rights.

Continue reading

Blog post

DEEP LEARNING FOR UNSTRUCTURED DATA? YES, YOU CAN !

August 9, 2021

Today, you take a picture of a paper bill and it gets suddenly processed by your banking app without you doing anything but confirming through Face Id recognition. Today, you speak to your microphone’s car while driving and it starts calling someone from your contact list. Today, you are probably old-fashioned if you never used google translate to process some sentence in another language, right?

Expert's opinion

UPGRADE OF A SEMARCHY XDM SOLUTION

August 9, 2021

In 2014, one of our clients (leading provider of packaging worldwide) sought a solution to bring structure to their customer base. They reached out to Keyrus who designed and developed the Customer Data Integration (CDI) tool.

Blog post

BE MORE EFFECTIVE THAN DOLLY PARTON ON OPEN BANKING

August 9, 2021

Appropriate action is a combination of marketing automation and of the personal touch by your frontline staff. Make it data driven.

Blog post

WHY DATA SCIENCE NEEDS BOTH MACHINE LEARNING AND CAUSAL DATA ANALYSIS

August 9, 2021

Data Science is running complex machine learning algorithms on ever growing datasets. The promise towards business stakeholders is to replace gut decisions and experience with objective and improving algorithms. But is machine learning the only game in town data scientists need to help business decision making?

Blog post

RISE OF THE CITIZEN DATA SCIENTIST

August 9, 2021

And why you still can’t replace your employees with software completely...

Blog post

DATA SCIENCE EXPLAINED BY BAKING CAKE

August 6, 2021

A few times I have been asked what it is I do exactly as a data-scientist, and managers and potentials data-scientists especially are interested in the common struggles we as data-scientists have to deal with. Just listing all issues we comes across would not result in an interesting read, so I will present it to you in the form of an analogy you’re all familiar with: baking cake.

Blog post

Data Visualization and Decision Making

August 6, 2021

“In 2019, one of the leading actors in the Oil Industry, was assessing different possibilities for the implementation of a mobile payment solution in their B2B segment. In order to be able to take data driven decisions, they reached out to Keyrus to set-up a data visualization solution.”

Blog post

DATA AND ANALYTICS: THE FOUR TYPES OF DATA ANALYTICS

August 6, 2021

You might have heard the saying Data is the new oil. This mainly refers to their potential value and in both cases this value is not merely in the raw product but rather results from the way it is processed. In this article we present a commonly used classification of data and analytics into descriptive, diagnostic, predictive, and prescriptive analytics. We’ll discuss each of these separately including some of the commonly used methods. Thereafter follows how these four types of data analytics relate to each other. First however we’ll explain what we exactly mean with data and analytics.

Blog post

THE HUMAN BEHIND THE DATA (PART 2 OF 3)

August 3, 2021

In part 1 of this series you read all about the difficulties to stay objective when selecting the data you want to work with. Simpson’s paradox, multicollinearity, Robinson’s paradox, survivorship bias, and cherry picking were all issues showing how important your decisions as a processor of data are. In this second part we’ll show that you yourself can become data which will seriously influence the outcome of your research, and we’ll also show how critical it is to choose the right measurement tool.

Blog post

THE HUMAN BEHIND THE DATA (PART 1 OF 3)

August 3, 2021

Humans are not very rational beings, even though we think we are. This impacts our personal as well as our professional lives, and the latter is of particular importance if you often work with data. If you work in business intelligence, data science, or any related field, people expect you to deliver them an objective truth. In this article we’ll discuss a lot of pitfalls that undermine this goal. Knowledge of these will help you avoid these mistakes and also to spot them in other people’s work. Many of the topics covered in this article involve pitfalls that can be classified as biases or paradoxes.