When it comes to data preparation and data analytics, Alteryx is one of the go-to platforms. It provides a perfect platform for integrating data from all your data resources, but often, the data is not in the best format for analysis. Transforming your data is one of the essential aspects of data preparation is reshaping the data for visual analytics or data science projects. This blog post offers a step-by-step guide on reshaping data with Alteryx, from pivoting to transposing and more.
What is Data Reshaping?
Data reshaping refers to the process of restructuring the format of a dataset. It involves altering the rows, columns, or structure to transform the data into a form that best serves your analytical needs. For visual analysis, you need a dataset with one row per variable. For Data Science, you want to isolate a categorical variable so that only one value is represented in a single column.
I previously wrote about Tidy Data in Alteryx, and data shaping leverages the concepts discussed there. To understand why Tidy Data is important, you can read that post, and here I will focus on the mechanics of reshaping.
But as a quick highlight:
Why is Reshaping Important?
- Easier Analysis: Reshaping data can make it easier to perform complex analyses.
- Data Integrity: It ensures data consistency and integrity.
- Optimization: Reshaped data can speed up queries and improve the performance of analytics tools.
Reshaping Techniques in Alteryx
There are many ways to reshape data in Alteryx depending on your goals. You can combine multiple transformation steps into a Macro, combine a series of steps into a workflow container or use a single tool.
In reality, the single tool option is a key component of all the different options and will usually leverage one of three tools:
It is useful to note that for every tool above Alteryx has created a “One tool Example”. This example shows how a tool is configured and some common uses for it. These One tool examples are the template I will use to demonstrate the configurations.
Transposing is the technique of converting rows into columns. This makes the data set narrower and longer. This is often used in Business visualisation processes with BI tools like Tableau or PowerBI. It is also used in Python or R scripts.
Steps to Transpose in Alteryx:
- Open Alteryx and drag your data into the workflow.
- Add a Transpose tool from the “Transform” tool bin.
- Connect the tool to your data and configure the settings.
- Select the fields that will be repeated down the table.
- Select the records that will become the new rows in the output table.
Pivoting involves summarizing and rotating data for easier analysis. This is the opposite process from the transpose, where the table becomes shorter and wider. When configuring the Cross Tab tool, the Group by section is the records you want to repeat, while the “Change Column Headers” is the new columns created and “Values for New Columns” are the content in the rows.
Steps to Pivot in Alteryx:
- Open Alteryx and drag your data into the workflow.
- Add a ‘Cross Tab’ tool from the “Transform” tool bin.
- Connect the tool to your data.
- Configure the settings like name, aggregation methods, and headers.
Sampling allows you to take a subset of your data for quicker analysis. While Alteryx can analyse datasets with hundreds of millions of records, iterating quickly over the dataset allows for a more fluid analysis process. Reducing the dataset size allows for faster iteration by reducing waiting times for each run cycle.
There are many options for reducing the dataset size, from limiting the Input tool to sampling tools. Each of these options allows you to reduce the dataset size at different points of the workflow.
Global Record Limit
The first option for limiting record inputs is to use the global record limit. In the Runtime tab of the workflow configuration, you can set a record limit for every input tool. By setting the global record limit, all input tools will return a subset of the entire data source from the start of the input.
An additional benefit of using the global option is when you have completed your iterations and are ready to run the analysis on the entire data set, you only make a change in one location. You can quickly reset the limit reading all records from all inputs by making one change.
Input Tool Record Limit
In some situations, using the global limit doesn’t make sense. Most datasets might be relatively small, but one or two significantly slow your iteration rate. For this example, you can use the record limit included in an input tool.
In the options of every input tool, the first option is to define a record limit. This record limit works the same as the global limit, and the tool will import the records and, once the record limit is hit, will stop importing. It allows the limitation to be isolated to a particular tool and not impact any other inputs.
Sometimes limiting the records that are being read doesn’t make sense at all. In Alteryx, there are three main tools for in-workflow sampling:
- Sample Tool
- Select Records
- Random % Sample
These three tools allow you to subsample your data set. The Select Records and Random % Sample tools are relatively simple and complete the sampling very simply.
The Select Records tool allows the selection of records based on the record number. For example, adding the 1-5 in the configuration page would select records 1, 2, 3, 4 & 5. Entering -100 would select all records up to 100, while 100+ would select all records after the 100th.
In comparison, the Random % Sample tool will randomly select the data based on the proportions you choose. For example, if you configure the tool to take 5% of the samples, your 100,000 record input would become 5,000 records. But rather than taking 5,000 sequential records, they would be distributed across the full data stream.
The Sample tool works a little differently and requires more configuration. First, you select what type of sampling you want from the choices:
- First N rows
- Last N rows
- Skip 1st N rows
- 1 of every N rows
- 1 in N chance to include each row
This set of options changes the decision of if each row is included. The “First”, “Last”, and “Skip” options are relatively self-explanatory. The last two options add a couple of probabilistic options for sampling. You can learn more about each option from the Alteryx help pages.
When considering the sampling options, each provides specific use cases for the sampling tools. The select records tool is great when specific records are needed; it is very computationally efficient and can process the sampling fast. But it is limited in the records that can be returned. On the other hand, the Sample and Random sample tools can be more representative of the underlying data, but they require the entire data set and applying calculations to determine if each record is needed, resulting in a slower process. You will have to decide if the trade-off is right in your application.
Reshaping data is a vital step in data preparation and analytics. Alteryx offers a range of tools and techniques for reshaping your data effectively. From transposing to pivoting, the platform is designed to make the data-reshaping process as straightforward as possible. Mastering these techniques can significantly enhance your data analysis capabilities, leading to more meaningful insights.
Happy data reshaping with Alteryx!