A few times I have been asked what it is I do exactly as a data-scientist, and managers and potentials data-scientists especially are interested in the common struggles we as data-scientists have to deal with. Just listing all issues we comes across would not result in an interesting read, so I will present it to you in the form of an analogy you’re all familiar with: baking cake.
Many different recipes for cake exists, and the same can be said for models in data-science. The advantage is twofold: Firstly we can choose our recipe based on available ingredients. Secondly we have a choice in complexity ranging from classic models from statistics to the most complex deep-learning methods. Given the improvements made in recent years regarding availability and ease of use of the latter we’re often tempted to go for one of these, but if your client is hungry you might be better of starting with a more basic cake.
Your cake ingredients are your data: when you try to put everything ready it turns out the eggs you have are rotten (bad data quality), and some key ingredients are simply missing. In an ideal situation all necessary ingredients can be found at your local (data)warehouse, but most of the times you’ll have to shop around a bit. You shouldn’t limit yourself to data already available within the company but rather broaden your horizon with data from web API’s, open data, or in some cases even buying external data.
Throwing every ingredient in a bowl and just hoping for the best outcome generally does not result in tasty cake. In many data-science projects combining the data is the most time-consuming part, most of all if the data originates from a wide array of sources. A difficult choice we have to make is the granularity on which we want to analyse the data.
Take for example a study towards health and illness in a big population: For each citizen you know how many days they were ill every year – so you don’t know which days of the year, only the yearly total – and your task is to determine as many significant predictors of illness. We can write this as granularity
citizen X year
(or vice versa), and this would imply every row in our dataset should correspond with one person in one particular year. Now one of your other datasets contains air pollution levels per region (let’s say a region is some specific geographical area, so not just a postal code or city name), but these measurements are only done biyearly (region X 2-years
).How do we join these datasets? First let’s take a look at the periodic variable: We could either create a
region X 2-years
dataset by averaging days of illness over two years, or, we could use interpolation on the pollution dataset, so for the years where no measurements took place we estimate the level of pollution based on the measurements of the previous and next year.In this case I would argue for the latter solution as a reduction of the information in the target variable (days of illness) will be more harmful to our predictive power than adding a bit of noise on one of the predictors. Be tentative however when you apply such an interpolation or another way to estimate the missing data points. If for example the level of pollution over time follows a very irregular pattern the estimation might be way of. The pollution is now converted to granularity
region X year
, but we still can’t join it with the illness data directly because that data does not contain a region-variable. What we do have for every citizen during each year is their address. Using a process called geocoding the address can be converted to coordinates (i.e. longitude/latitude).For each of the regions in the pollution dataset we have a polygon so it’s possible to assign each of the coordinates (and therefore citizens homes) to one of the regions by calculating in which polygon the coordinate lies, and therefore we can now enrich the illness dataset with the level of air pollution. This is just one example to show how combining multiple datasets is often a little bit less straightforward than just joining tables in a database.
Baking your cake – or calculating/running your data-science model – is seen as the core of the process, but is also often the easiest part. What outsiders see as the magic the data-scientist is capable of performing is in practice not more than running a few lines of code; assuming all previous steps were followed thoroughly of course. One of the few things that can go wrong is keeping your cake in the oven for too long. With that I mean overfitting of course. In some cases – depending on the amount of data and the complexity of the model – you can take the baking analogy quite literally as it might seriously heat up your CPU or GPU.
Even though the cake is now out of the oven we’re far from finished. The cake needs frosting, decorations, and what else you can put on a cake to make it more attractive as you will have to sell your data-science cake to a client whose appetite still needs a boost. Whatever came out of R, Python, or any other brand of oven has meaning to the data-scientist but not as much to the client. The following picture gives an example of an undecorated cake versus a decorated cake.
On the left is R output of a rather simple model: a generalised linear regression with two continuous predictors, one categorical predictor and some first order interaction effects. On the right you see the frosting applied to a cake in the form of a spider.
This is an intuitive way to present which variables contribute most in a predictive model; the width of the legs corresponds with an appropriate effect size or statistical significance; hovering over one of the predictors reveals a bar chart showing average levels of the target variable as function of different levels of that predictor (the image you see here is not interactive however).
Figure 1: Left: typical R output. Right: a “spider”. The spider shown corresponds with the earlier example to explain and predict illness (all data is fictitious). We see a very thin line between Marital status and Illness which indicates that one’s marital status has barely an effect on the prediction of illness. Pollution on the other hand is connected with Illnessâ with a relatively thick line, representing a big effect size. This can be investigated in detail by hovering over the Pollution-box revealing a bar chart representing levels of Illness for different levels of Pollution. If the predictor (pollution) is continuous you could also use a scatter plot with a regression line. In my experience however this adds too much clutter to the graph so it might be preferable to bin your pollution in more or less equally sized groups and present it as a categorical variable by means of a bar chart.
Figure 2: Left: undecorated cake, right out of the oven. Right: the same cake while being decorated professionally by the authors nieces; Milla (almost 5) and Amber (7). Going full-scale
Baking one cake is nice, but if you need many similar cakes you need to move on from your archaic methods and get yourself a production line. For the ingredients you needed to shop for, an automated pipeline has to be created. Such a pipeline would typically take the form of extract-transform-load (ETL) processes that make new data available at either fixed intervals or continuously. Your recipe probably takes the form of a Jupyter or R-Markdown notebook, but this is the equivalent of the cake recipe your grandma once wrote on a napkin. It’s time for you to publish your baking book in the form of a package.
Apart from being easily deployable on a bigger scale it shall also be more mature than your notebook in that it contains versioning and unit tests. The output will also be created automatically (including decorations and frosting of course), but instead of landing on your computer it should immediately be put on a cake stand integrated in the clients reporting system.
If your client gets hungry at irregular intervals you might consider creating a cake-mix. This contains everything else mentioned in this paragraph, but adds a small app that enables the client to create a fresh cake on demand.
I hope this little allegory helped you understand some of the difficulties a data-scientist faces. Hungry, but not sure if you should handle the dough yourself? Keyrus can provide you with some of the best bakers!
Any questions? Don’t hesitate to contact me: joris.pieters@keyrus.com