Exercise 3: Data Transformation in Python¶
Now that we have cleaned our dataset, it is ready to be analysed. This exercise teaches how to sort and filter data, summarize and aggregate data, add and rename columns, reshape data, perform log transformations, and export data to CSV or Excel.
from IPython.display import display, HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
health_survey_clean = pd.read_csv("sample_health_data.csv").dropna().drop_duplicates()
health_survey_clean['sex'] = health_survey_clean['sex'].astype('category')
Filter Data¶
To filter for entries that meet certain criteria, we can use the query()
function or square bracket []
notation. We already know how to filter for specific columns using square bracket []
notation, as we did in the previous exercise, where we filtered for the sex
column.
Let's take this up a notch, using square brackets to filter for rows where the sex
column is 'Female':
condition = health_survey_clean['sex'] == 'Female'
condition
0 True 1 False 2 True 3 False 4 True ... 625 True 626 True 627 True 628 True 629 True Name: sex, Length: 620, dtype: bool
This prints a series containing True/False values for each row, indicating whether the condition is met.
We can then use this boolean series to filter the DataFrame, including only the rows where the condition is True:
female_entries = health_survey_clean[condition]
female_entries.head()
id | year | sex | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | |
---|---|---|---|---|---|---|---|
0 | 1 | 2021 | Female | 3.4 | 0.00 | 3.4 | 72.2 |
2 | 3 | 2019 | Female | 3.3 | 1.91 | 3.3 | 78.3 |
4 | 5 | 2010 | Female | 3.2 | 22.04 | 3.2 | 71.2 |
9 | 10 | 2008 | Female | 3.3 | 0.00 | 3.3 | 78.6 |
14 | 15 | 2010 | Female | 3.3 | 0.00 | 3.3 | 77.1 |
Alternatively, we can achieve the same result using the query()
function, giving it the condition sex == "Female"
, for a more readable syntax:
female_entries = health_survey_clean.query('sex == "Female"')
female_entries.head()
id | year | sex | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | |
---|---|---|---|---|---|---|---|
0 | 1 | 2021 | Female | 3.4 | 0.00 | 3.4 | 72.2 |
2 | 3 | 2019 | Female | 3.3 | 1.91 | 3.3 | 78.3 |
4 | 5 | 2010 | Female | 3.2 | 22.04 | 3.2 | 71.2 |
9 | 10 | 2008 | Female | 3.3 | 0.00 | 3.3 | 78.6 |
14 | 15 | 2010 | Female | 3.3 | 0.00 | 3.3 | 77.1 |
Sort Data¶
To sort the data, we can use the sort_values()
function. For instance, let's sort the dataset by alcohol_consumption_mean_weekly_units
in descending order:
sorted_by_alcohol = health_survey_clean.sort_values(by='alcohol_consumption_mean_weekly_units', ascending=False)
sorted_by_alcohol.head()
id | year | sex | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | |
---|---|---|---|---|---|---|---|
475 | 476 | 2016 | Male | 2.7 | 126.90 | 2.7 | 75.0 |
42 | 43 | 2016 | Male | 2.8 | 104.44 | 2.8 | 76.8 |
438 | 439 | 2018 | Male | 2.8 | 101.20 | 2.8 | 71.2 |
440 | 441 | 2015 | Male | 2.8 | 96.28 | 2.8 | 68.4 |
376 | 377 | 2016 | Female | 2.9 | 85.55 | 2.9 | 76.0 |
Combining filtering and sorting allows us to perform more complex data manipulation seamlessly. In Python, we can achieve this by chaining the filter and sort functions together.
For example, let's filter for female entries and then sort them by alcohol consumption in descending order:
female_sorted_by_alcohol = health_survey_clean[condition].sort_values(by='alcohol_consumption_mean_weekly_units', ascending=False)
female_sorted_by_alcohol.head()
id | year | sex | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | |
---|---|---|---|---|---|---|---|
376 | 377 | 2016 | Female | 2.9 | 85.55 | 2.9 | 76.0 |
377 | 378 | 2012 | Female | 3.0 | 62.54 | 3.0 | 73.0 |
387 | 388 | 2015 | Female | 3.0 | 60.86 | 3.0 | 77.3 |
369 | 370 | 2014 | Female | 3.0 | 49.79 | 3.0 | 69.0 |
493 | 494 | 2014 | Female | 3.1 | 38.73 | 3.1 | 76.0 |
Summarize and Aggregate Data¶
Now that we can filter rows and select specific columns of our dataset, we can also summarize and aggregate this data to calculate summary statistics, such as the mean.
For example, we can group our dataset by year
using the groupby()
function:
health_survey_group = health_survey_clean.groupby('year')
Then, we can calculate the mean annual values of each numeric column using the agg()
function, passing a dictionary using curly brackets {}
where the keys are the column names and the values are the aggregation functions we want to apply (in this case, the mean).
For example, 'fruit_vegetable_consumption_mean_daily_portions': 'mean'
means that for each group (each year), we calculate the mean of the fruit_vegetable_consumption_mean_daily_portions
column.
health_survey_means = health_survey_group.agg({
'fruit_vegetable_consumption_mean_daily_portions': 'mean',
'alcohol_consumption_mean_weekly_units': 'mean',
'mental_wellbeing': 'mean',
'self_assessed_health': 'mean'
})
The agg()
function allows us to apply the aggregation functions (mean() in this case) to each group and return the results in a new DataFrame. By default, agg()
handles missing values automatically, but it's always good to ensure your data is clean.
Finally, we use reset_index()
to convert the index back into a regular column, making the DataFrame easier to work with.
health_survey_summary = health_survey_means.reset_index()
health_survey_summary
year | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | |
---|---|---|---|---|---|
0 | 2008 | 3.179535 | 14.256512 | 3.179535 | 74.148837 |
1 | 2009 | 3.203256 | 12.937209 | 3.203256 | 74.646512 |
2 | 2010 | 3.178222 | 15.202667 | 3.178222 | 73.491111 |
3 | 2011 | 3.194318 | 13.092273 | 3.194318 | 74.581818 |
4 | 2012 | 3.169545 | 17.819091 | 3.169545 | 73.600000 |
5 | 2013 | 3.202955 | 13.719091 | 3.202955 | 74.268182 |
6 | 2014 | 3.189778 | 17.950444 | 3.189778 | 73.897778 |
7 | 2015 | 3.196222 | 17.189778 | 3.196222 | 73.993333 |
8 | 2016 | 3.179302 | 20.876047 | 3.179302 | 74.162791 |
9 | 2017 | 3.242667 | 12.333111 | 3.242667 | 73.873333 |
10 | 2018 | 3.218000 | 17.040000 | 3.218000 | 73.060000 |
11 | 2019 | 3.240889 | 12.678889 | 3.240889 | 72.808889 |
12 | 2021 | 3.279556 | 12.342222 | 3.279556 | 74.491111 |
13 | 2022 | 3.254091 | 13.163636 | 3.254091 | 72.622727 |
Let's rename the columns in the summary DataFrame for clarity. We'll use the rename
function and pass a dictionary where the keys are the old column names and the values are the new column names:
health_survey_summary.rename(columns={
'fruit_vegetable_consumption_mean_daily_portions': 'mean_fruit_vegetables',
'alcohol_consumption_mean_weekly_units': 'mean_alcohol',
'mental_wellbeing': 'mean_mental_wellbeing',
'self_assessed_health': 'mean_self_assessed_health'
}, inplace=True)
health_survey_summary
year | mean_fruit_vegetables | mean_alcohol | mean_mental_wellbeing | mean_self_assessed_health | |
---|---|---|---|---|---|
0 | 2008 | 3.179535 | 14.256512 | 3.179535 | 74.148837 |
1 | 2009 | 3.203256 | 12.937209 | 3.203256 | 74.646512 |
2 | 2010 | 3.178222 | 15.202667 | 3.178222 | 73.491111 |
3 | 2011 | 3.194318 | 13.092273 | 3.194318 | 74.581818 |
4 | 2012 | 3.169545 | 17.819091 | 3.169545 | 73.600000 |
5 | 2013 | 3.202955 | 13.719091 | 3.202955 | 74.268182 |
6 | 2014 | 3.189778 | 17.950444 | 3.189778 | 73.897778 |
7 | 2015 | 3.196222 | 17.189778 | 3.196222 | 73.993333 |
8 | 2016 | 3.179302 | 20.876047 | 3.179302 | 74.162791 |
9 | 2017 | 3.242667 | 12.333111 | 3.242667 | 73.873333 |
10 | 2018 | 3.218000 | 17.040000 | 3.218000 | 73.060000 |
11 | 2019 | 3.240889 | 12.678889 | 3.240889 | 72.808889 |
12 | 2021 | 3.279556 | 12.342222 | 3.279556 | 74.491111 |
13 | 2022 | 3.254091 | 13.163636 | 3.254091 | 72.622727 |
Now, we have created a new DataFrame that contains the mean values for fruit and vegetable consumption, alcohol consumption, mental wellbeing, and self-assessed health scores for each year. This summarized data allows us to make comparisons and observe trends over time.
Export DataFrame¶
Exporting your data to different file formats allows you to save your results and share them with others.
To export the DataFrame health_survey_summary
to a CSV file, use the to_csv()
function. This function takes the file path as an argument and saves the DataFrame to a CSV file.
health_survey_summary.to_csv('health_survey_summary.csv')
To export the DataFrame to an Excel file, we use the to_excel()
function, giving the file path as an argument:
health_survey_summary.to_excel('health_survey_summary.xlsx')
You should see both 'health_survey_summary.csv' and 'health_survey_summary.xlsx' appear in your working directory.
Transform and Create New Variables¶
Next, we will learn how to add new columns to a DataFrame or modify existing ones. This is a useful tool for enhancing data interpretation, facilitating comparisons, improving model performance, and deriving meaningful insights.
Let's create a new variable that reflects an overall measure of health, combining diet, alcohol, mental wellbeing, and self-assessed health according to the formula:
overall_health
= 3 x mean_fruit_vegetables
- 0.5 x mean_alcohol
+ mean_mental_wellbeing
+ 0.1 x mean_self_assessed_health
We can use pandas to do this by creating a new column overall_health
:
health_survey_clean['overall_health'] = (
3 * health_survey_clean['fruit_vegetable_consumption_mean_daily_portions']
- 0.5 * health_survey_clean['alcohol_consumption_mean_weekly_units']
+ health_survey_clean['mental_wellbeing']
+ 0.1 * health_survey_clean['self_assessed_health']
)
health_survey_clean.head()
id | year | sex | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | overall_health | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 2021 | Female | 3.40 | 0.00 | 3.40 | 72.2 | 20.820 |
1 | 2 | 2012 | Male | 3.19 | 17.50 | 3.19 | 76.8 | 11.690 |
2 | 3 | 2019 | Female | 3.30 | 1.91 | 3.30 | 78.3 | 20.075 |
3 | 4 | 2009 | Male | 3.30 | 0.00 | 3.30 | 71.2 | 20.320 |
4 | 5 | 2010 | Female | 3.20 | 22.04 | 3.20 | 71.2 | 8.900 |
This calculates a new variable based on the formula, and the result of the calculation is added as a new column overall_health
at the end of the DataFrame.
Reshaping Data¶
Sometimes we may want to reshape our data to make it easier to visualize or model. Pandas provides functions such as melt()
and pivot()
to transform data between wide and long formats.
- Wide Format: Standard "Excel" format, where each variable is in a separate column. The
health_survey
DataFrame is in wide format. - Long Format: Data where each row contains one variable per subject.
Converting from wide to long format¶
Let's convert our health_survey_clean
DataFrame from wide to long format using the melt()
function. The id_vars
parameter specifies which columns to keep as identifiers, value_var
specifies the columns to pivot, var_name
specifies the name of the new column that will contain the variable names, and value_name
specifies the name of the new column that will contain the values:
health_survey_long = pd.melt(
health_survey_clean,
id_vars='id',
value_vars=[
'fruit_vegetable_consumption_mean_daily_portions',
'alcohol_consumption_mean_weekly_units',
'mental_wellbeing',
'self_assessed_health'
],
var_name='measurement',
value_name='value'
)
health_survey_long.head()
id | measurement | value | |
---|---|---|---|
0 | 1 | fruit_vegetable_consumption_mean_daily_portions | 3.40 |
1 | 2 | fruit_vegetable_consumption_mean_daily_portions | 3.19 |
2 | 3 | fruit_vegetable_consumption_mean_daily_portions | 3.30 |
3 | 4 | fruit_vegetable_consumption_mean_daily_portions | 3.30 |
4 | 5 | fruit_vegetable_consumption_mean_daily_portions | 3.20 |
Note: All entries in the "value" column must be of the same type, so you cannot combine types like categorical and integer.
As you can see, the first few rows correspond to the same measurement but for different entries. This means that there are more rows in the long format DataFrame but fewer columns.
Converting from long to wide format¶
We can convert our DataFrame back from long to wide format using the pivot()
function. In this function, index
specifies the column that contains the unique identifiers, columns
specifies the column that contains the variable names, and values
specifies the column that contains the values.
# Convert long format back to wide format
health_survey_wide = health_survey_long.pivot(
index='id',
columns='measurement',
values='value'
)
health_survey_wide.head()
measurement | alcohol_consumption_mean_weekly_units | fruit_vegetable_consumption_mean_daily_portions | mental_wellbeing | self_assessed_health |
---|---|---|---|---|
id | ||||
1 | 0.00 | 3.40 | 3.40 | 72.2 |
2 | 17.50 | 3.19 | 3.19 | 76.8 |
3 | 1.91 | 3.30 | 3.30 | 78.3 |
4 | 0.00 | 3.30 | 3.30 | 71.2 |
5 | 22.04 | 3.20 | 3.20 | 71.2 |
Log Transformations¶
Log transformations can be helpful in correcting skewed data by compressing the range of the data. Skewness measures the asymmetry of the distribution of values in a dataset. A skewness value greater than 1 or less than -1 indicates a highly skewed distribution, and such columns might benefit from log transformation. This transformation can make the data more normally distributed and suitable for further analysis.
First, we inspect the skewness of each numeric column in the DataFrame using the skew()
function in pandas:
numeric_columns = ['alcohol_consumption_mean_weekly_units', 'fruit_vegetable_consumption_mean_daily_portions', 'mental_wellbeing', 'self_assessed_health']
skewness = health_survey_clean[numeric_columns].skew()
print(skewness)
alcohol_consumption_mean_weekly_units 3.999931 fruit_vegetable_consumption_mean_daily_portions -0.290814 mental_wellbeing -0.290814 self_assessed_health -0.233775 dtype: float64
A skewness value greater than 1 or less than -1 indicates a highly skewed distribution. We can see that alcohol_consumption_mean_weekly_units
has a skewness value of 4, indicating it is strongly right-skewed.
Let's correct the skewness by applying a log transformation to the alcohol_consumption_mean_weekly_units
column. We can use the transform(np.log10)
function to do this:
health_survey_clean['log_alcohol_consumption'] = health_survey_clean['alcohol_consumption_mean_weekly_units'].transform(np.log10)
health_survey_clean.head()
id | year | sex | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | overall_health | log_alcohol_consumption | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2021 | Female | 3.40 | 0.00 | 3.40 | 72.2 | 20.820 | -inf |
1 | 2 | 2012 | Male | 3.19 | 17.50 | 3.19 | 76.8 | 11.690 | 1.243038 |
2 | 3 | 2019 | Female | 3.30 | 1.91 | 3.30 | 78.3 | 20.075 | 0.281033 |
3 | 4 | 2009 | Male | 3.30 | 0.00 | 3.30 | 71.2 | 20.320 | -inf |
4 | 5 | 2010 | Female | 3.20 | 22.04 | 3.20 | 71.2 | 8.900 | 1.343212 |