Tidy Data in Alteryx

Once you start getting into data science one of the biggest challenges you face, and where much of your time is spent, is getting data into a tidy and usable state. I came across an blog post (by Jean-Nicolas Hould) recently which applied the process for Tidy Data as defined by Hadley Wickham of R (ggplot2, plyr, lubridate packages and more) and R Studio fame where the process of creating tidy data, as defined in Wickham’s paper, was applied in Python. I thought putting in my 2 cents and applying this process to Alteryx would be really useful for a number of reasons. First, the Alteryx predictive models are based on R and providing tidy data helps with the process. Secondly, the processes used for creating tidy data are built-in as some of the key data preparation processes in Alteryx. And finally, I want to get more practice creating data sets for predicative analysis (I really don’t get to do as much as I want)
So here is the thing, the section titles that I’m going to use are taken straight from Wickham’s paper (why bother re-inventing the wheel). No one ever accused me of being the most creative person on earth Smile

Defining Tidy Data

So what is tidy data? From the definition that Wickham has outlined, tidy data has the following properties:

  • Each variable forms a column and contains values
  • Each observation forms a row
  • Each type of observational unit forms a table

breaking out those definitions (so were all talking the same language):

  • Variable: A measurement or an attribute. Height, weight, sex, etc.
  • Value: The actual measurement or attribute. 152 cm, 80 kg, female etc.
  • Observation: All values measured on the same unit. Each person.

A messy dataset example:

Treatment A Treatment B
John Smith 2
Jane Doe 16 11
Mary Johnson 3 1

A tidy dataset example:

Treatment Result
John Smith a
John Smith b 2
Jane Doe a 16
Jane Doe b 11
Mary Johnson a 3
Mary Johnson b 1

Tidying Messy Datasets

In Wickham’s paper there are 5 types of messy data that we will look at:

  • Column headers are values, not variable names
  • Multiple variables are stored in one column
  • Variables are stored in both rows and columns
  • Multiple types of observational units are stored in the same table
  • A single observational unit is stored in multiple tables.

Column headers are values, not variable names

Often data comes in a format that is really designed to be viewed by people, usually in presentations of some kind, with variables stored as column names (the income groups)

Example 1. Pew Research Center report. Source

This dataset has been “pivoted” to be viewed in a table of a printed report. To convert it into a tidy format we need to reverse this.

Religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
Agnostic 27 34 60 81 76 137
Atheist 12 27 37 52 35 70
Buddhist 27 21 30 34 33 58
Catholic 418 617 732 670 638 1116
Don’t know/refused 15 12 15 11 10 35
Evangelical Prot 575 869 1064 982 881 1486
Hindu 1 9 7 9 11 34
Historically Black Prot 228 244 236 238 197 223
Jehovahs Witness 20 27 24 24 21 30
Jewish 19 19 25 25 30 95

The first ten rows of data on income and religion from the Pew Forum. Three columns, $75-100k, $100-150k and >150k, have been omitted
 
The process for cleaning this in Alteryx is very simple taking two tools. First the transpose tool, choosing the religion column as a key (this field gets stacked together and repeated) with the rest of the columns left as the data fields. This is followed by a Select tool to rename the new columns (name and value) to something more understandable (income and frequency)
image
 
The head of the tidy dataset

religion income freq
Agnostic <$10k 27
Agnostic $10-20k 34
Agnostic $20-30k 60
Agnostic $30-40k 81
Agnostic $40-50k 76
Agnostic $50-75k 137
Atheist $10-20k 27
Atheist $20-30k 37
Atheist $30-40k 52
Atheist $40-50k 35

Example 2: Billboard Top 100 Dataset
This second example is taken from the billboard top 100 songs. It contains data where:

  • The columns are composed of the week number values (x1st.week etc.)
  • If a song is in the top 100 for less than 75 weeks, the remaining columns are nulls (not helpful for analysis).

year artist.inverted track time genre date.entered date.peaked x1st.week x2nd.week
2000 Destiny’s Child Independent Women Part I 3:38 Rock 2000-09-23 2000-11-18 78 63.0
2000 Santana Maria, Maria 4:18 Rock 2000-02-12 2000-04-08 15 8.0
2000 Savage Garden I Knew I Loved You 4:07 Rock 1999-10-23 2000-01-29 71 48.0
2000 Madonna Music 3:45 Rock 2000-08-12 2000-09-16 41 23.0
2000 Aguilera, Christina Come On Over Baby (All I Want Is You) 3:38 Rock 2000-08-05 2000-10-14 57 47.0
2000 Janet Doesn’t Really Matter 4:17 Rock 2000-06-17 2000-08-26 59 52.0
2000 Destiny’s Child Say My Name 4:31 Rock 1999-12-25 2000-03-18 83 83.0
2000 Iglesias, Enrique Be With You 3:36 Latin 2000-04-01 2000-06-24 63 45.0
2000 Sisqo Incomplete 3:52 Rock 2000-06-24 2000-08-12 77 66.0
2000 Lonestar Amazed 4:25 Country 1999-06-05 2000-03-04 81 54.0

There are a couple of extra steps to mostly tidy this data set. According to Wickham’s definition the extra repetition of the Artist Name and track information (genre, time etc.) means the dataset is not completely tidy. Using Alteryx this can often be ignored as many of the problems with the replication can be handled “under the hood” of Alteryx and its R prediction macros. In either case this repetition will be taken care of in later steps.
image

The process starts with the same transposing of the data set, only this time we will select multiple columns as the key fields (year, artists.inverted etc). With all the week columns being left as a the data fields. The next step is to parse out the week number from the newly created “name” field. This is easily achieved with the Regex tool (regex is a text pattern recognition language that you can learn about elsewhere) in this case I used the regex query (d+) to group out one or more occurrences of digits (so the week numbers) into its own column called week. The final steps involve removing and renaming the columns where needed, filtering out the columns listed as “NA” or empty then finally sorting the entire dataset by the Artist, Track and the week number.

Multiple variables are stored in one column

Example: Tuberculosis Records from World Health Organization2

In the Tuberculosis records dataset there are a couple of key challenges:

  • Some columns contain multiple values: sex and age (e.g. m014 = Males aged under 14yo)
  • A mixture of zeros and NaN. This is important to note as the two values represent very different concepts and must be treated very differently in the analysis. Understanding where your data comes from is important in deciding how to treat these variables and the analysis that can be done on them.
country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014
AD 2000 0 0 1 0 0 0 0 NaN NaN
AE 2000 2 4 4 6 5 12 10 NaN 3
AF 2000 52 228 183 149 129 94 80 NaN 93
AG 2000 0 0 0 0 0 0 1 NaN 1
AL 2000 2 19 21 14 24 19 16 NaN
3
AM 2000 2 152 130 131 63 26 21 NaN 1
AN 2000 0 0 1 2 0 0 0 NaN 0
AO 2000 186 999 1003 912 482 312 194 NaN 247
AR 2000 97 278 594 402 419 368 330 NaN 121
AS 2000 NaN NaN NaN NaN 1 1 NaN NaN NaN

In the raw data file there are a couple of preparation issues that also need to be taken care of before we get to creating the tidy dataset. First in the headers there are extra prefixes that need to be removed and and columns
multiple Values in columns
To make the preparation easier to follow I filtered the data set down to a single year (2000) then started the analysis.
The first step in the preparation is to remove prefix from the columns and drop the unwanted columns. Next step is to transpose the variables into rows much like previously done.
Now we need to separate out the multiple variables from the single columns. Often this appears with delimiters built into the original column names (e.g. _, –, : etc.) and a simple text-to-columns tool can be used to separate the details. In this particular example we needed to apply a bit more sophisticated logic to separate out the columns with regular expressions (I learn about regular expressions from https://regexone.com/ tutorials and often refer back to http://www.regexr.com/ for assistance in building the expression that I want). For parsing out the sex from the ages I used the logic of
(D)(d{1,2})(d{2}$)
what this translates to is each pair of parentheses is a group or variable with the first being not a digit, next being one or two digits and the final column being two digits at the end of the sentence.
The last couple of steps are all about tidying up the final appearance of the table (e.g. changing the over 65 group to 65+) and removing the helper columns that were created.

Variables are stored in both rows and columns

Example: Global Historical Climatology Network Dataset3

In this dataset we see the daily weather records (stored in individual columns) for a particular weather station in Mexico. In addition to the days appearing in columns, the temperature element (min or max) appear in rows (rather than individual columns)

id year month element d1 d2 d3 d4 d5 d6 d7 d8
MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN NaN
MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN NaN
MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN NaN
MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN NaN
MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN NaN
MX17004 2010 3 tmin NaN NaN NaN NaN 14.2 NaN NaN NaN
MX17004 2010 4 tmax NaN NaN NaN NaN NaN NaN NaN NaN
MX17004 2010 4 tmin NaN NaN NaN NaN NaN NaN NaN NaN
MX17004 2010 5 tmax NaN NaN NaN NaN NaN NaN NaN NaN
MX17004 2010 5 tmin NaN NaN NaN NaN NaN NaN NaN NaN

The process for creating a tidy data set out of the rows and columns situation is knowing how to combine both the transpose tool and the cross tab tool and ensuring that the right key fields are selected each time.

values in both rows and columns
For this particular example it requires keeping the id, year, month and element as key fields while all the ‘d’ fields are returned as new rows. For the cross tab tool a similar process takes place. in this case the id, year, month and the newly created name field (containing the ‘d’ values) are key fields while the column headers selected are from the new element field while the value to fill in the records is from the value column.

Multiple types of observational units are stored in the same table

Example 2: Billboard Top 100 Dataset

The earlier billboard dataset actually contained a couple of different types of observational units in the single table, in this case the information about the tracks themselves and the rank in each week.
multiple observational units in single table
Extending out the original workflow a couple of additional steps are added first an track identifier field is added to the original workflow (using a multi-row formula and some IF THEN logic). The next step is to separate the track information into one data table, the second flow is to remove the track information from the rank table.

A single observational unit is stored in multiple tables.

Example: Illinois Male Baby Names for year 2014/20154

The final situation is when the same observational type appears in multiple tables. This is a common situation in business where a single file is published for a particular data set (monthly performance, or quarterly sales etc.).
In
our example of baby names each year a new file is published each year and needs to be combined into a single data set for further analysis. In Alteryx this is often a very simple problem. Using the input tool you can use wild cards such as * and ? to allow flexibility in which files are read into a workflow.
obervational values in multiple tables
Often over time the format of these files will change and those changes will require you to undertake many of the preparation steps described earlier and the bring those datasets together using the join and union tools. Unfortunately for this post the changes are usually unique to each situation so I wont cover those here. I am sure that in the future I will do a post on an analysis that I have undertaken and that will highlight the use of both those tools in detail. (you can also have a look at my post on data preparation 101 here)
Thank for reading its a pretty big post so hopefully you have go something out of it.

Sources:

1. http://religions.pewforum.org/pdf/comparison-Income%20Distribution%20of%20Religious% 20Traditions.pdf
2. http://www.who.int/tb/country/en/
3. https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn
4. https://www.ssa.gov/OACT/babynames/