Choosing the recipe and gathering the ingredients

Many different recipes for cake exists, and the same can be said for models in data-science. The advantage is twofold: Firstly we can choose our recipe based on available ingredients. Secondly we have a choice in complexity ranging from classic models from statistics to the most complex deep-learning methods. Given the improvements made in recent years regarding availability and ease of use of the latter we’re often tempted to go for one of these, but if your client is hungry you might be better of starting with a more basic cake.

Your cake ingredients are your data: when you try to put everything ready it turns out the eggs you have are rotten (bad data quality), and some key ingredients are simply missing. In an ideal situation all necessary ingredients can be found at your local (data)warehouse, but most of the times you’ll have to shop around a bit. You shouldn’t limit yourself to data already available within the company but rather broaden your horizon with data from web API’s, open data, or in some cases even buying external data.

Combining and stirring the ingredients

Throwing every ingredient in a bowl and just hoping for the best outcome generally does not result in tasty cake. In many data-science projects combining the data is the most time-consuming part, most of all if the data originates from a wide array of sources. A difficult choice we have to make is the granularity on which we want to analyse the data.

Take for example a study towards health and illness in a big population: For each citizen you know how many days they were ill every year – so you don’t know which days of the year, only the yearly total – and your task is to determine as many significant predictors of illness. We can write this as granularity 

citizen X year
 (or vice versa), and this would imply every row in our dataset should correspond with one person in one particular year. Now one of your other datasets contains air pollution levels per region (let’s say a region is some specific geographical area, so not just a postal code or city name), but these measurements are only done biyearly (
region X 2-years

How do we join these datasets? First let’s take a look at the periodic variable: We could either create a 

region X 2-years
 dataset by averaging days of illness over two years, or, we could use interpolation on the pollution dataset, so for the years where no measurements took place we estimate the level of pollution based on the measurements of the previous and next year.

In this case I would argue for the latter solution as a reduction of the information in the target variable (days of illness) will be more harmful to our predictive power than adding a bit of noise on one of the predictors. Be tentative however when you apply such an interpolation or another way to estimate the missing data points. If for example the level of pollution over time follows a very irregular pattern the estimation might be way of. The pollution is now converted to granularity 

region X year
, but we still can’t join it with the illness data directly because that data does not contain a region-variable. What we do have for every citizen during each year is their address. Using a process called geocoding the address can be converted to coordinates (i.e. longitude/latitude).

For each of the regions in the pollution dataset we have a polygon so it’s possible to assign each of the coordinates (and therefore citizens homes) to one of the regions by calculating in which polygon the coordinate lies, and therefore we can now enrich the illness dataset with the level of air pollution. This is just one example to show how combining multiple datasets is often a little bit less straightforward than just joining tables in a database.

The actual baking

Baking your cake – or calculating/running your data-science model – is seen as the core of the process, but is also often the easiest part. What outsiders see as the magic the data-scientist is capable of performing is in practice not more than running a few lines of code; assuming all previous steps were followed thoroughly of course. One of the few things that can go wrong is keeping your cake in the oven for too long. With that I mean overfitting of course. In some cases – depending on the amount of data and the complexity of the model – you can take the baking analogy quite literally as it might seriously heat up your CPU or GPU.

Presenting the result

Even though the cake is now out of the oven we’re far from finished. The cake needs frosting, decorations, and what else you can put on a cake to make it more attractive as you will have to sell your data-science cake to a client whose appetite still needs a boost. Whatever came out of R, Python, or any other brand of oven has meaning to the data-scientist but not as much to the client. The following picture gives an example of an undecorated cake versus a decorated cake.

On the left is R output of a rather simple model: a generalised linear regression with two continuous predictors, one categorical predictor and some first order interaction effects. On the right you see the frosting applied to a cake in the form of a spider.

This is an intuitive way to present which variables contribute most in a predictive model; the width of the legs corresponds with an appropriate effect size or statistical significance; hovering over one of the predictors reveals a bar chart showing average levels of the target variable as function of different levels of that predictor (the image you see here is not interactive however).

Figure 1: Left: typical R output. Right: a “spider”. The spider shown corresponds with the earlier example to explain and predict illness (all data is fictitious). We see a very thin line between Marital status and Illness which indicates that one’s marital status has barely an effect on the prediction of illness. Pollution on the other hand is connected with Illnessâ with a relatively thick line, representing a big effect size. This can be investigated in detail by hovering over the Pollution-box revealing a bar chart representing levels of Illness for different levels of Pollution. If the predictor (pollution) is continuous you could also use a scatter plot with a regression line. In my experience however this adds too much clutter to the graph so it might be preferable to bin your pollution in more or less equally sized groups and present it as a categorical variable by means of a bar chart.

Figure 2: Left: undecorated cake, right out of the oven. Right: the same cake while being decorated professionally by the authors nieces; Milla (almost 5) and Amber (7). Going full-scale

Baking one cake is nice, but if you need many similar cakes you need to move on from your archaic methods and get yourself a production line. For the ingredients you needed to shop for, an automated pipeline has to be created. Such a pipeline would typically take the form of extract-transform-load (ETL) processes that make new data available at either fixed intervals or continuously. Your recipe probably takes the form of a Jupyter or R-Markdown notebook, but this is the equivalent of the cake recipe your grandma once wrote on a napkin. It’s time for you to publish your baking book in the form of a package.

Apart from being easily deployable on a bigger scale it shall also be more mature than your notebook in that it contains versioning and unit tests. The output will also be created automatically (including decorations and frosting of course), but instead of landing on your computer it should immediately be put on a cake stand integrated in the clients reporting system.

If your client gets hungry at irregular intervals you might consider creating a cake-mix. This contains everything else mentioned in this paragraph, but adds a small app that enables the client to create a fresh cake on demand.

I hope this little allegory helped you understand some of the difficulties a data-scientist faces. Hungry, but not sure if you should handle the dough yourself? Keyrus can provide you with some of the best bakers!

Any questions? Don’t hesitate to contact me:


Never miss an insight

Stay updated on the latest articles, events, and more

Your email address is only used to send you the Keyrus newsletter. You can use the unsubscribe link in each newsletter sent at any time. Learn more about the management of your data and your rights.

Continue reading

Blog post

Be more effective than dolly parton on open banking

August 9, 2021

Appropriate action is a combination of marketing automation and of the personal touch by your frontline staff. Make it data driven.

Blog post

Deep learning for unstructured data ? Yes, you gan !

August 9, 2021

Today, you take a picture of a paper bill and it gets suddenly processed by your banking app without you doing anything but confirming through Face Id recognition. Today, you speak to your microphone’s car while driving and it starts calling someone from your contact list. Today, you are probably old-fashioned if you never used google translate to process some sentence in another language, right?

Expert's opinion

Upgrade of a semarchy XDM Solution

August 9, 2021

In 2014, one of our clients (leading provider of packaging worldwide) sought a solution to bring structure to their customer base. They reached out to Keyrus who designed and developed the Customer Data Integration (CDI) tool.

Blog post

Why data science needs both machine learning and casual data analysis

August 9, 2021

Data Science is running complex machine learning algorithms on ever growing datasets. The promise towards business stakeholders is to replace gut decisions and experience with objective and improving algorithms. But is machine learning the only game in town data scientists need to help business decision making?

Blog post

Rise of the citizen data scientist

August 9, 2021

And why you still can’t replace your employees with software completely...

Blog post

Data Visualization and Decision Making

August 6, 2021

“In 2019, one of the leading actors in the Oil Industry, was assessing different possibilities for the implementation of a mobile payment solution in their B2B segment. In order to be able to take data driven decisions, they reached out to Keyrus to set-up a data visualization solution.”

Blog post

Data and analytics : the four types of data analytics

August 6, 2021

You might have heard the saying Data is the new oil. This mainly refers to their potential value and in both cases this value is not merely in the raw product but rather results from the way it is processed. In this article we present a commonly used classification of data and analytics into descriptive, diagnostic, predictive, and prescriptive analytics. We’ll discuss each of these separately including some of the commonly used methods. Thereafter follows how these four types of data analytics relate to each other. First however we’ll explain what we exactly mean with data and analytics.

Blog post

The human behind the data (part 2 of 3)

August 3, 2021

In part 1 of this series you read all about the difficulties to stay objective when selecting the data you want to work with. Simpson’s paradox, multicollinearity, Robinson’s paradox, survivorship bias, and cherry picking were all issues showing how important your decisions as a processor of data are. In this second part we’ll show that you yourself can become data which will seriously influence the outcome of your research, and we’ll also show how critical it is to choose the right measurement tool.

Blog post

The human behind the data (part 1 of 3)

August 3, 2021

Humans are not very rational beings, even though we think we are. This impacts our personal as well as our professional lives, and the latter is of particular importance if you often work with data. If you work in business intelligence, data science, or any related field, people expect you to deliver them an objective truth. In this article we’ll discuss a lot of pitfalls that undermine this goal. Knowledge of these will help you avoid these mistakes and also to spot them in other people’s work. Many of the topics covered in this article involve pitfalls that can be classified as biases or paradoxes.

Blog post

The human behind the data (part 3 of 3)

August 3, 2021

You’ve made it to the third and final part in our series ‘The human behind the data’. This will all be about (illusionary) patterns and the importance of some good old probability theory.