I have recently started delivering some sessions of the Information Lab free public Alteryx training (see the details here). As part of the training I need to get my notes and ideas in shape to better deliver the session, so I figure what better way to get that all together than writing a blog post and hopefully get some feedback from the community on what can be added, changed or better explained. So without further ado. Data blending in Alteryx 101.
Data Blending 101
Part 1 – Survey Analytics
One of the major ETL (extract, transform, load) processes, analysts will often get a files sent to them on a regular schedule, sometimes each will match the same format or schema (see the explanation of what a database schema is here), sometimes they don’t. The first example is a series of survey files they all follow same schema.
The process of importing and preparing the data set follows a logical pathway. Using an input tool you navigate to the folder with all the files you want to import and select one. From here you Alteryx has a nice trick of using a wild card input so instead of Survey_Store_9123.csv you can have Survey_Store_*.csv which will select all the maps with this pattern.
So the survey results can be browsed easily we put in a sort to arrange the data in ascending order based on the survey number.
The next issue is that across multiple surveys one person can answer multiple survey cycles. To get around this we used the unique tool. The unique tool selects the first instance of an entry for keeping and sends all subsequent appearances to the duplicate output. For this example we identify duplicates by the customer name and their date of birth.
Next we want to remove any surplus columns that are in the data and change the field types to numeric columns where needed (in our case the question fields are converted to Int16).
The final step is to convert the data set to a tall and narrow data set for Tableau. To achieve this we use the transpose tool.
|Configuring the Transpose tool|
The setup for the transpose tool is pretty simple the top section “Key Fields” are all the fields that you want to be repeated with each row of data. The Data fields are the values you want to appear in the data set. You aren’t able to set the name of the column at this point but you can add a select tool immediately after the transpose and rename there.
Now you can export directly to a Tableau extract (tde file) and begin your analysis.
Part 2 – Sales vs Survey.
Combining separate data sources is a common and never ending process. Data set constantly come to the analyst from excel, databases, scraped from a website, twitter, the list goes on. Having a way to combine them into a single data stream for analysis can be difficult to complete and challenging to replicate. So if you want to create ad-hoc analysis it can be an impediment to getting a rich and complex data set that you can use.
Following from our survey data the next question of us is how does that information reflect in the sales? The sales data isn’t stored in the same place, shape or format. Joining the data sets together requires getting the data into a comparable shape. Logically we would want to look at individual sales for each store. The process for aggregating the data in Alteryx is simply to use the summarise tool.
|Summarising data is simple in Alteryx|
In the summarise configuration setting StoreID to GroupBy will mean every store ID appears as a record in the results. While including sales avg and profit avg returns the aggregations that we want. Worth noting is that any field not selected in the actions will not appear downstream. This means that any information in the Purchase Date or Year fields is lost downstream and cannot be used further. For this example that is what we want so it’s good.
We now want to aggregate the survey data to the same store level so they can be joined together. First step is to import the data using the same workflow above stopping before the transpose tool (we want to keep it at the survey response level for the joining we can transpose later). We then place the summary tool and repeat the summary we wanted for the surveys (group by storeID average all the response types).
At this point we have two data sets aggregated to the storeID level and we have a unique field we can join the data together on. By dragging the Join tool onto the page and defining the join parameters of StoreID == StoreID the data sets come together.
Alteryx Joins Segway
Joins in Alteryx work a little interesting. From the join tool you get 3 output choices “L”, “J”, and “R”. The “L” join is a left outer join, it takes all the data in the left hand stream that has no matches in the right stream. The “R” join is a right outer join which is just the exact opposite. The “J” is the Inner Join field. This is all the data that has a match in the left and right data streams. U applications that you find will need all the data from one side of the join and any matches to it (Left outer joins in SQL speak) to achieve that you will need to attach a union after the join and match automatically according to the column names. Laszlo Zsom over on The Information Lab blog has a great post covering how Alteryx and Tableau treat joins.
Back to the post
So we want to join the sales with the surveys and we only want to look at the places where there is both a survey and sales. For this we will take the “J” output. To finish we output to a tde and analyse in Tableau
Part 3 – Twitter Analysis
Social media is everywhere and it is an increasingly important metric for analysis being able to assign the sentiment of a tweet is a useful tool for tracking the social media presence. It can also help confirm/refute the results that we get from the surveys and be associated with the sales information.
|A simple sentiment analysis workflow to analyse a Twitter log|
Using a prepared twitter results set (in a company you could use a scheduled connection to the twitter API). Using the sentiment macro we can apply a score to how positive each tweet is, while we are using it as a bit of a black box (or black bubble in this case) you can look at what is happening under the hood by right clicking -> open macro.
Next we want to separate all the individual words into their own individual rows (so we can do analysis on how different words impact sentiment). Using the Text To Columns tool we define the delimiter as s which selects each spaces though the message to split against. When the Split to Rows option is selected then new rows are generated for each new word.
Now we do a process of tidying on the data set by removing the extra columns generated in the sentiment tool, a little Regex to remove all characters that aren’t normal letters and convert all the text to lower case for consistency.
The final step is to join the twitter feed against a set of common words. By selecting the Left option we remove all the common fields from the view so they don’t skew any further analysis.
Part 4 – Census Data
When a company is looking to expand, open their first physical store or even decide on the best place to put their warehouse trying to work out the best reach for a company location is key. One of the factors we will look at is what people are around in an area where we plan to open a store? For that information we get the Census data together which is often delivered in many different files with some key linking fields, a common situation with many different reasons (database outputs, web scraped data from different pages, competitor pricing analysis for particular product types).
|Lots of files with different schemas?|
When we receive a folder with lots of different file contained in it without Alteryx some people may be driven to despair, that is a lot of copy paste actions or custom SQL to write. With Alteryx it’s simple in our example we received a folder with five different census files: age, cars, economic activity, ethnic group and household composition. The plan is to combine these files into a single Alteryx database (yxbd) so we can complete the geographic analysis later.
It would be nice to use the same census*.csv trick from the earlier example inside the files the schema is completely different. What we do have is a consistent set of linking codes (date and geography code) across all the census files. While we could use a series of joins with unions (to create a full outer join) and repeat this until all the files have come together. In Alteryx there is a better way. There is a nice little tool called the join multiple tool, with this tool you specify the joining fields (date and geography code for us) and bring all the files together at once. With the join multiple you also choose what fields you wish to keep and rename any fields if you wish.
|And they all come together|
You do have to be careful with the join multiple as you can quickly create Cartesian joins making a lot of records very quickly, that’s why you have the drop down choices if it happens.
For us the joins are nice and simple an all that is left is to prepare the data for analysis later. So we select the fields we want to keep, change their data type where needed, remove any records with null dates (they are error fields) and finally remove the extra space in the geography code to use as a match key later.
At this point of the training day we move on to the spatial analytics. We will move onto that section of the training another day.
I would love to hear any feedback from anyone to help me improve and develop my sessions.