Once you start getting into data science one of the biggest challenges you face, and where much of your time is spent, is getting data into a tidy and usable state. I came across an blog post (by Jean-Nicolas Hould) recently which applied the process for Tidy Data as defined by Hadley Wickham of R (ggplot2, plyr, lubridate packages and more) and R Studio fame where the process of creating tidy data, as defined in Wickham’s paper, was applied in Python. I thought putting in my 2 cents and applying this process to Alteryx would be really useful for a number of reasons. First, the Alteryx predictive models are based on R and providing tidy data helps with the process. Secondly, the processes used for creating tidy data are built-in as some of the key data preparation processes in Alteryx. And finally, I want to get more practice creating data sets for predicative analysis (I really don’t get to do as much as I want)
So here is the thing, the section titles that I’m going to use are taken straight from Wickham’s paper (why bother re-inventing the wheel). No one ever accused me of being the most creative person on earth
Defining Tidy Data
So what is tidy data? From the definition that Wickham has outlined, tidy data has the following properties:
- Each variable forms a column and contains values
- Each observation forms a row
- Each type of observational unit forms a table
breaking out those definitions (so were all talking the same language):
- Variable: A measurement or an attribute. Height, weight, sex, etc.
- Value: The actual measurement or attribute. 152 cm, 80 kg, female etc.
- Observation: All values measured on the same unit. Each person.
A messy dataset example:
Treatment A | Treatment B | |
John Smith | – | 2 |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
A tidy dataset example:
Treatment | Result | |
John Smith | a | – |
John Smith | b | 2 |
Jane Doe | a | 16 |
Jane Doe | b | 11 |
Mary Johnson | a | 3 |
Mary Johnson | b | 1 |
Tidying Messy Datasets
In Wickham’s paper there are 5 types of messy data that we will look at:
- Column headers are values, not variable names
- Multiple variables are stored in one column
- Variables are stored in both rows and columns
- Multiple types of observational units are stored in the same table
- A single observational unit is stored in multiple tables.
Column headers are values, not variable names
Often data comes in a format that is really designed to be viewed by people, usually in presentations of some kind, with variables stored as column names (the income groups)
Example 1. Pew Research Center report. Source
This dataset has been “pivoted” to be viewed in a table of a printed report. To convert it into a tidy format we need to reverse this.
Religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k |
Agnostic | 27 | 34 | 60 | 81 | 76 | 137 |
Atheist | 12 | 27 | 37 | 52 | 35 | 70 |
Buddhist | 27 | 21 | 30 | 34 | 33 | 58 |
Catholic | 418 | 617 | 732 | 670 | 638 | 1116 |
Don’t know/refused | 15 | 12 | 15 | 11 | 10 | 35 |
Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 |
Hindu | 1 | 9 | 7 | 9 | 11 | 34 |
Historically Black Prot | 228 | 244 | 236 | 238 | 197 | 223 |
Jehovahs Witness | 20 | 27 | 24 | 24 | 21 | 30 |
Jewish | 19 | 19 | 25 | 25 | 30 | 95 |
The first ten rows of data on income and religion from the Pew Forum. Three columns, $75-100k, $100-150k and >150k, have been omitted
The process for cleaning this in Alteryx is very simple taking two tools. First the transpose tool, choosing the religion column as a key (this field gets stacked together and repeated) with the rest of the columns left as the data fields. This is followed by a Select tool to rename the new columns (name and value) to something more understandable (income and frequency)
The head of the tidy dataset
religion | income | freq |
Agnostic | <$10k | 27 |
Agnostic | $10-20k | 34 |
Agnostic | $20-30k | 60 |
Agnostic | $30-40k | 81 |
Agnostic | $40-50k | 76 |
Agnostic | $50-75k | 137 |
Atheist | $10-20k | 27 |
Atheist | $20-30k | 37 |
Atheist | $30-40k | 52 |
Atheist | $40-50k | 35 |
Example 2: Billboard Top 100 Dataset
This second example is taken from the billboard top 100 songs. It contains data where:
- The columns are composed of the week number values (x1st.week etc.)
- If a song is in the top 100 for less than 75 weeks, the remaining columns are nulls (not helpful for analysis).
year | artist.inverted | track | time | genre | date.entered | date.peaked | x1st.week | x2nd.week | … |
---|---|---|---|---|---|---|---|---|---|
2000 | Destiny’s Child | Independent Women Part I | 3:38 | Rock | 2000-09-23 | 2000-11-18 | 78 | 63.0 | … |
2000 | Santana | Maria, Maria | 4:18 | Rock | 2000-02-12 | 2000-04-08 | 15 | 8.0 | … |
2000 | Savage Garden | I Knew I Loved You | 4:07 | Rock | 1999-10-23 | 2000-01-29 | 71 | 48.0 | … |
2000 | Madonna | Music | 3:45 | Rock | 2000-08-12 | 2000-09-16 | 41 | 23.0 | … |
2000 | Aguilera, Christina | Come On Over Baby (All I Want Is You) | 3:38 | Rock | 2000-08-05 | 2000-10-14 | 57 | 47.0 | … |
2000 | Janet | Doesn’t Really Matter | 4:17 | Rock | 2000-06-17 | 2000-08-26 | 59 | 52.0 | … |
2000 | Destiny’s Child | Say My Name | 4:31 | Rock | 1999-12-25 | 2000-03-18 | 83 | 83.0 | … |
2000 | Iglesias, Enrique | Be With You | 3:36 | Latin | 2000-04-01 | 2000-06-24 | 63 | 45.0 | … |
2000 | Sisqo | Incomplete | 3:52 | Rock | 2000-06-24 | 2000-08-12 | 77 | 66.0 | … |
2000 | Lonestar | Amazed | 4:25 | Country | 1999-06-05 | 2000-03-04 | 81 | 54.0 | … |
There are a couple of extra steps to mostly tidy this data set. According to Wickham’s definition the extra repetition of the Artist Name and track information (genre, time etc.) means the dataset is not completely tidy. Using Alteryx this can often be ignored as many of the problems with the replication can be handled “under the hood” of Alteryx and its R prediction macros. In either case this repetition will be taken care of in later steps.
The process starts with the same transposing of the data set, only this time we will select multiple columns as the key fields (year, artists.inverted etc). With all the week columns being left as a the data fields. The next step is to parse out the week number from the newly created “name” field. This is easily achieved with the Regex tool (regex is a text pattern recognition language that you can learn about elsewhere) in this case I used the regex query (d+) to group out one or more occurrences of digits (so the week numbers) into its own column called week. The final steps involve removing and renaming the columns where needed, filtering out the columns listed as “NA” or empty then finally sorting the entire dataset by the Artist, Track and the week number.
Multiple variables are stored in one column
Example: Tuberculosis Records from World Health Organization2
In the Tuberculosis records dataset there are a couple of key challenges:
- Some columns contain multiple values: sex and age (e.g. m014 = Males aged under 14yo)
- A mixture of zeros and NaN. This is important to note as the two values represent very different concepts and must be treated very differently in the analysis. Understanding where your data comes from is important in deciding how to treat these variables and the analysis that can be done on them.
country | year | m014 | m1524 | m2534 | m3544 | m4554 | m5564 | m65 | mu | f014 |
---|---|---|---|---|---|---|---|---|---|---|
AD | 2000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NaN | NaN |
AE | 2000 | 2 | 4 | 4 | 6 | 5 | 12 | 10 | NaN | 3 |
AF | 2000 | 52 | 228 | 183 | 149 | 129 | 94 | 80 | NaN | 93 |
AG | 2000 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NaN | 1 |
AL | 2000 | 2 | 19 | 21 | 14 | 24 | 19 | 16 | NaN |
3 |
AM | 2000 | 2 | 152 | 130 | 131 | 63 | 26 | 21 | NaN | 1 |
AN | 2000 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | NaN | 0 |
AO | 2000 | 186 | 999 | 1003 | 912 | 482 | 312 | 194 | NaN | 247 |
AR | 2000 | 97 | 278 | 594 | 402 | 419 | 368 | 330 | NaN | 121 |
AS | 2000 | NaN | NaN | NaN | NaN | 1 | 1 | NaN | NaN | NaN |
In the raw data file there are a couple of preparation issues that also need to be taken care of before we get to creating the tidy dataset. First in the headers there are extra prefixes that need to be removed and and columns
To make the preparation easier to follow I filtered the data set down to a single year (2000) then started the analysis.
The first step in the preparation is to remove prefix from the columns and drop the unwanted columns. Next step is to transpose the variables into rows much like previously done.
Now we need to separate out the multiple variables from the single columns. Often this appears with delimiters built into the original column names (e.g. _, –, : etc.) and a simple text-to-columns tool can be used to separate the details. In this particular example we needed to apply a bit more sophisticated logic to separate out the columns with regular expressions (I learn about regular expressions from https://regexone.com/ tutorials and often refer back to http://www.regexr.com/ for assistance in building the expression that I want). For parsing out the sex from the ages I used the logic of
(D)(d{1,2})(d{2}$)
what this translates to is each pair of parentheses is a group or variable with the first being not a digit, next being one or two digits and the final column being two digits at the end of the sentence.
The last couple of steps are all about tidying up the final appearance of the table (e.g. changing the over 65 group to 65+) and removing the helper columns that were created.
Variables are stored in both rows and columns
Example: Global Historical Climatology Network Dataset3
In this dataset we see the daily weather records (stored in individual columns) for a particular weather station in Mexico. In addition to the days appearing in columns, the temperature element (min or max) appear in rows (rather than individual columns)
id | year | month | element | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|
MX17004 | 2010 | 1 | tmax | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
MX17004 | 2010 | 1 | tmin | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
MX17004 | 2010 | 2 | tmax | NaN | 27.3 | 24.1 | NaN | NaN | NaN | NaN | NaN |
MX17004 | 2010 | 2 | tmin | NaN | 14.4 | 14.4 | NaN | NaN | NaN | NaN | NaN |
MX17004 | 2010 | 3 | tmax | NaN | NaN | NaN | NaN | 32.1 | NaN | NaN | NaN |
MX17004 | 2010 | 3 | tmin | NaN | NaN | NaN | NaN | 14.2 | NaN | NaN | NaN |
MX17004 | 2010 | 4 | tmax | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
MX17004 | 2010 | 4 | tmin | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
MX17004 | 2010 | 5 | tmax | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
MX17004 | 2010 | 5 | tmin | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The process for creating a tidy data set out of the rows and columns situation is knowing how to combine both the transpose tool and the cross tab tool and ensuring that the right key fields are selected each time.
For this particular example it requires keeping the id, year, month and element as key fields while all the ‘d’ fields are returned as new rows. For the cross tab tool a similar process takes place. in this case the id, year, month and the newly created name field (containing the ‘d’ values) are key fields while the column headers selected are from the new element field while the value to fill in the records is from the value column.
Multiple types of observational units are stored in the same table
Example 2: Billboard Top 100 Dataset
The earlier billboard dataset actually contained a couple of different types of observational units in the single table, in this case the information about the tracks themselves and the rank in each week.
Extending out the original workflow a couple of additional steps are added first an track identifier field is added to the original workflow (using a multi-row formula and some IF THEN logic). The next step is to separate the track information into one data table, the second flow is to remove the track information from the rank table.
A single observational unit is stored in multiple tables.
Example: Illinois Male Baby Names for year 2014/20154
The final situation is when the same observational type appears in multiple tables. This is a common situation in business where a single file is published for a particular data set (monthly performance, or quarterly sales etc.).
In
our example of baby names each year a new file is published each year and needs to be combined into a single data set for further analysis. In Alteryx this is often a very simple problem. Using the input tool you can use wild cards such as * and ? to allow flexibility in which files are read into a workflow.
Often over time the format of these files will change and those changes will require you to undertake many of the preparation steps described earlier and the bring those datasets together using the join and union tools. Unfortunately for this post the changes are usually unique to each situation so I wont cover those here. I am sure that in the future I will do a post on an analysis that I have undertaken and that will highlight the use of both those tools in detail. (you can also have a look at my post on data preparation 101 here)
Thank for reading its a pretty big post so hopefully you have go something out of it.
Sources:
1. http://religions.pewforum.org/pdf/comparison-Income%20Distribution%20of%20Religious% 20Traditions.pdf
2. http://www.who.int/tb/country/en/
3. https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn
4. https://www.ssa.gov/OACT/babynames/