AT Data Science – Tuning for the best sound

In the last entry for this series, we created a Random Forest model to try and predict which passengers survived their trip on the Titanic. Without any customisation of the Random Forest, we were able to achieve an accuracy of 0.75119, not as good as our default decision tree (0.78468) so no improvement in our overall score. The next step would be to take the default Random Forest and change the settings to get a better result.

Every machine learning algorithm has a number of settings which are defined before the model is trained, usually based on the developers best guess and experience. Often, whatever parameters are used by default are okay for most applications but are rarely the best set configurations. But how to decide on what is the best? This process is called hyperparameter tuning and is possible for all machine learning models.

What to improve

Alteryx Random Forest Customisation Options
Unchanged Random Forest Customisation Options

When we look at the Random Forest model in Alteryx we have a number of options that we can customise. The main parameters to change to improve performance are the total allowable nodes in a tree and how many variables are considered at each node in a tree.

How do we decide on what are better combinations? What settings should be used when tuning our model? The most comprehensive way of testing all the combinations of parameters is a process called Grid Search.

Grid Search is a process where all the combinations of the parameters are tested by training a model for every combination. What this means is that if you have 10 possible options for the number of variables at each node in the tree and another 10 options for total nodes in the tree, then in a grid search you train a total of 100 models and decide on the best model from all those combinations.

Alteryx hyper tuning macro
My macro for hyper tuning a random forest in Alteryx

Tuning with macros

Alteryx is a great platform for creating and running repeatable processes. By creating a batch macro the 100 combinations of the parameters can be tested in, relatively, quick time. A batch macro is a workflow that will take a list of options and repeat the entire workflow for each entry in the list. The key interface tool for creating a batch macro is the Control Parameter, where the value that will get updated in each iteration is provided to the workflow.

The final configuration for a batch macro is to connect the control parameters options to the tool that needs updating (the Forest Model Tool for us) with an action interface tool which sets options we need (for my macro, I update the ‘Num Vars’ Value, and the ‘Total.nodes’ value) to the next value in the list.

full tuning workflow
The full macro process for testing the entire full grid search list

Now we can run the macro for the configuration list. I used two generate rows tools to create the grid search list for my tuning macro. I then ran the results through the tuning macro into a charting tool to decide on the options I want. The trained model I found performed well with 5 variables per node and a maximum of 60 nodes in the tree. This combination resulted in the maximum accuracy (83%).

So How did I do?

The end result of this model is that Kaggle score of 0.78947 an improvement overall moving us up the table to 3440, and improving a little more.

Kaggle Submission Leaderboard
Kaggle Submission Leaderboard

So what else can we do?

Next time, we will take a look at trying some neural networks and how they can improve our overall performance.

Data Engineering with Alteryx

  • Learn DataOps principles to build data pipelines with Alteryx
  • Build robust data pipelines with Alteryx Designer
  • Use Alteryx Server and Alteryx Connect to share and deploy your data pipelines

Leave a comment