The Journey Begins
For any budding Data Scientist to build their data science chops and reputation, they need to develop lots of models in different applications. One of the best resources to do that and see how well you are doing compared to other people is Kaggle. The first question to answer is what is Kaggle? Well according to Wikipedia “Kaggle is a platform for predictive modelling and analytics competitions in which statisticians and data miners compete to produce the best models for predicting and describing the datasets uploaded by companies and users. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modelling task and it is impossible to know beforehand which technique or analyst will be most effective.” Basically, Kaggle is a community where companies of any size and motive can upload a data set, provide a task for Kagglers (i.e. what the community needs to predict) then the competition begins. Individuals and teams form to develop models and processes to solve the host’s problem. Person with the best solution (based on the competitions rules) wins. The prize can be anything from experience to $USD1,000,000!
Kaggle Titanic with Alteryx and Tableau
The titanic dataset is one of the most common dataset introductory data science and it is a great start to a the Kaggle competition circle. While there is no prize for this competition it is a good knowledge and experience builder.
As with any data science project the first step is to understand what data set you are looking at. This is often achieved through Exploratory Data Analysis, an area where Tableau Shines.
On initial analysis of the we can see that one of the likely key factors will be Sex (Women and Children first) along with socio-economic status (passenger class). Something else that may help is the size of the family (bigger families tend to have lower survival rates) but that could be more related to the socio-economic status again (lower classes tend have more children) so keeping an eye on the correlations between those variables will be important. We can leverage Alteryx to look at the correlations that can be found across the data set.
When looking at the static report produced by Alteryx we get confirmation of what we saw in our Tableau EDA where the sex (0.54) and Pclass (-0.36) has a strong correlation with survival. Taking a moment to interpret the output, females are coded as 1 so positive value indicates females survive more, while the negative value for Pclass shows 1st class passengers survive more. Digging deeper we see a strong relationship between Pclass and Fare (1st class tickets cost more), and a reasonable relationship between Pclass and Age (1st class passengers are generally older).
What this is telling us is that there are a few key variables with other supporting, and potentially colinear variables that we need to consider how to treat.
The last thing I want to highlight is the field summary output from the field summary tool, it shows an overview of the distribution for each field. In addition to the distribution, we will need to consider the missing values from the Age, cabin and embarked fields. When trying to build out our predictive models these missing values will cause issues in the training process and prevent us from making a good model. How we address these missing fields will change how we are going to produce a model, for example replacing the missing ages with a simple mean or median could add a bias into our model and change our predictions. Making that substitution isn’t necessarily wrong, but we will need to investigate and justify if it will hold up.
There are a number of other things that would need to be explored to understand the relationships between all the different factors but I will leave that to you. Till next time good luck.