Exercise 3: Data Transformation in Python¶

Now that we have cleaned our dataset, it is ready to be analysed. This exercise teaches how to sort and filter data, summarize and aggregate data, add and rename columns, reshape data, perform log transformations, and export data to CSV or Excel.

In [1]:
from IPython.display import display, HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

health_survey_clean = pd.read_csv("sample_health_data.csv").dropna().drop_duplicates()
health_survey_clean['sex'] = health_survey_clean['sex'].astype('category')

Filter Data¶

To filter for entries that meet certain criteria, we can use the query() function or square bracket [] notation. We already know how to filter for specific columns using square bracket [] notation, as we did in the previous exercise, where we filtered for the sex column.

Let's take this up a notch, using square brackets to filter for rows where the sex column is 'Female':

In [2]:
condition = health_survey_clean['sex'] == 'Female'
condition
Out[2]:
0       True
1      False
2       True
3      False
4       True
       ...  
625     True
626     True
627     True
628     True
629     True
Name: sex, Length: 620, dtype: bool

This prints a series containing True/False values for each row, indicating whether the condition is met.

We can then use this boolean series to filter the DataFrame, including only the rows where the condition is True:

In [3]:
female_entries = health_survey_clean[condition]

female_entries.head()
Out[3]:
id year sex fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
0 1 2021 Female 3.4 0.00 3.4 72.2
2 3 2019 Female 3.3 1.91 3.3 78.3
4 5 2010 Female 3.2 22.04 3.2 71.2
9 10 2008 Female 3.3 0.00 3.3 78.6
14 15 2010 Female 3.3 0.00 3.3 77.1

Alternatively, we can achieve the same result using the query() function, giving it the condition sex == "Female", for a more readable syntax:

In [4]:
female_entries = health_survey_clean.query('sex == "Female"')

female_entries.head()
Out[4]:
id year sex fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
0 1 2021 Female 3.4 0.00 3.4 72.2
2 3 2019 Female 3.3 1.91 3.3 78.3
4 5 2010 Female 3.2 22.04 3.2 71.2
9 10 2008 Female 3.3 0.00 3.3 78.6
14 15 2010 Female 3.3 0.00 3.3 77.1

Sort Data¶

To sort the data, we can use the sort_values() function. For instance, let's sort the dataset by alcohol_consumption_mean_weekly_units in descending order:

In [5]:
sorted_by_alcohol = health_survey_clean.sort_values(by='alcohol_consumption_mean_weekly_units', ascending=False)

sorted_by_alcohol.head()
Out[5]:
id year sex fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
475 476 2016 Male 2.7 126.90 2.7 75.0
42 43 2016 Male 2.8 104.44 2.8 76.8
438 439 2018 Male 2.8 101.20 2.8 71.2
440 441 2015 Male 2.8 96.28 2.8 68.4
376 377 2016 Female 2.9 85.55 2.9 76.0

Combining filtering and sorting allows us to perform more complex data manipulation seamlessly. In Python, we can achieve this by chaining the filter and sort functions together.

For example, let's filter for female entries and then sort them by alcohol consumption in descending order:

In [6]:
female_sorted_by_alcohol = health_survey_clean[condition].sort_values(by='alcohol_consumption_mean_weekly_units', ascending=False)

female_sorted_by_alcohol.head()
Out[6]:
id year sex fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
376 377 2016 Female 2.9 85.55 2.9 76.0
377 378 2012 Female 3.0 62.54 3.0 73.0
387 388 2015 Female 3.0 60.86 3.0 77.3
369 370 2014 Female 3.0 49.79 3.0 69.0
493 494 2014 Female 3.1 38.73 3.1 76.0

Summarize and Aggregate Data¶

Now that we can filter rows and select specific columns of our dataset, we can also summarize and aggregate this data to calculate summary statistics, such as the mean.

For example, we can group our dataset by year using the groupby() function:

In [7]:
health_survey_group = health_survey_clean.groupby('year')

Then, we can calculate the mean annual values of each numeric column using the agg() function, passing a dictionary using curly brackets {} where the keys are the column names and the values are the aggregation functions we want to apply (in this case, the mean).

For example, 'fruit_vegetable_consumption_mean_daily_portions': 'mean' means that for each group (each year), we calculate the mean of the fruit_vegetable_consumption_mean_daily_portions column.

In [8]:
health_survey_means = health_survey_group.agg({
    'fruit_vegetable_consumption_mean_daily_portions': 'mean',
    'alcohol_consumption_mean_weekly_units': 'mean',
    'mental_wellbeing': 'mean',
    'self_assessed_health': 'mean'
})

The agg() function allows us to apply the aggregation functions (mean() in this case) to each group and return the results in a new DataFrame. By default, agg() handles missing values automatically, but it's always good to ensure your data is clean.

Finally, we use reset_index() to convert the index back into a regular column, making the DataFrame easier to work with.

In [9]:
health_survey_summary = health_survey_means.reset_index()

health_survey_summary
Out[9]:
year fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
0 2008 3.179535 14.256512 3.179535 74.148837
1 2009 3.203256 12.937209 3.203256 74.646512
2 2010 3.178222 15.202667 3.178222 73.491111
3 2011 3.194318 13.092273 3.194318 74.581818
4 2012 3.169545 17.819091 3.169545 73.600000
5 2013 3.202955 13.719091 3.202955 74.268182
6 2014 3.189778 17.950444 3.189778 73.897778
7 2015 3.196222 17.189778 3.196222 73.993333
8 2016 3.179302 20.876047 3.179302 74.162791
9 2017 3.242667 12.333111 3.242667 73.873333
10 2018 3.218000 17.040000 3.218000 73.060000
11 2019 3.240889 12.678889 3.240889 72.808889
12 2021 3.279556 12.342222 3.279556 74.491111
13 2022 3.254091 13.163636 3.254091 72.622727

Let's rename the columns in the summary DataFrame for clarity. We'll use the rename function and pass a dictionary where the keys are the old column names and the values are the new column names:

In [10]:
health_survey_summary.rename(columns={
    'fruit_vegetable_consumption_mean_daily_portions': 'mean_fruit_vegetables',
    'alcohol_consumption_mean_weekly_units': 'mean_alcohol',
    'mental_wellbeing': 'mean_mental_wellbeing',
    'self_assessed_health': 'mean_self_assessed_health'
}, inplace=True)

health_survey_summary
Out[10]:
year mean_fruit_vegetables mean_alcohol mean_mental_wellbeing mean_self_assessed_health
0 2008 3.179535 14.256512 3.179535 74.148837
1 2009 3.203256 12.937209 3.203256 74.646512
2 2010 3.178222 15.202667 3.178222 73.491111
3 2011 3.194318 13.092273 3.194318 74.581818
4 2012 3.169545 17.819091 3.169545 73.600000
5 2013 3.202955 13.719091 3.202955 74.268182
6 2014 3.189778 17.950444 3.189778 73.897778
7 2015 3.196222 17.189778 3.196222 73.993333
8 2016 3.179302 20.876047 3.179302 74.162791
9 2017 3.242667 12.333111 3.242667 73.873333
10 2018 3.218000 17.040000 3.218000 73.060000
11 2019 3.240889 12.678889 3.240889 72.808889
12 2021 3.279556 12.342222 3.279556 74.491111
13 2022 3.254091 13.163636 3.254091 72.622727

Now, we have created a new DataFrame that contains the mean values for fruit and vegetable consumption, alcohol consumption, mental wellbeing, and self-assessed health scores for each year. This summarized data allows us to make comparisons and observe trends over time.

Export DataFrame¶

Exporting your data to different file formats allows you to save your results and share them with others.

To export the DataFrame health_survey_summary to a CSV file, use the to_csv() function. This function takes the file path as an argument and saves the DataFrame to a CSV file.

In [11]:
health_survey_summary.to_csv('health_survey_summary.csv')

To export the DataFrame to an Excel file, we use the to_excel() function, giving the file path as an argument:

In [12]:
health_survey_summary.to_excel('health_survey_summary.xlsx')

You should see both 'health_survey_summary.csv' and 'health_survey_summary.xlsx' appear in your working directory.

Transform and Create New Variables¶

Next, we will learn how to add new columns to a DataFrame or modify existing ones. This is a useful tool for enhancing data interpretation, facilitating comparisons, improving model performance, and deriving meaningful insights.

Let's create a new variable that reflects an overall measure of health, combining diet, alcohol, mental wellbeing, and self-assessed health according to the formula:

overall_health = 3 x mean_fruit_vegetables - 0.5 x mean_alcohol + mean_mental_wellbeing + 0.1 x mean_self_assessed_health

We can use pandas to do this by creating a new column overall_health:

In [13]:
health_survey_clean['overall_health'] = (
    3 * health_survey_clean['fruit_vegetable_consumption_mean_daily_portions']
    - 0.5 * health_survey_clean['alcohol_consumption_mean_weekly_units']
    + health_survey_clean['mental_wellbeing']
    + 0.1 * health_survey_clean['self_assessed_health']
)

health_survey_clean.head()
Out[13]:
id year sex fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health overall_health
0 1 2021 Female 3.40 0.00 3.40 72.2 20.820
1 2 2012 Male 3.19 17.50 3.19 76.8 11.690
2 3 2019 Female 3.30 1.91 3.30 78.3 20.075
3 4 2009 Male 3.30 0.00 3.30 71.2 20.320
4 5 2010 Female 3.20 22.04 3.20 71.2 8.900

This calculates a new variable based on the formula, and the result of the calculation is added as a new column overall_health at the end of the DataFrame.

Reshaping Data¶

Sometimes we may want to reshape our data to make it easier to visualize or model. Pandas provides functions such as melt() and pivot() to transform data between wide and long formats.

  • Wide Format: Standard "Excel" format, where each variable is in a separate column. The health_survey DataFrame is in wide format.
  • Long Format: Data where each row contains one variable per subject.

Converting from wide to long format¶

Let's convert our health_survey_clean DataFrame from wide to long format using the melt() function. The id_vars parameter specifies which columns to keep as identifiers, value_var specifies the columns to pivot, var_name specifies the name of the new column that will contain the variable names, and value_name specifies the name of the new column that will contain the values:

In [14]:
health_survey_long = pd.melt(
    health_survey_clean,
    id_vars='id',
    value_vars=[
        'fruit_vegetable_consumption_mean_daily_portions',
        'alcohol_consumption_mean_weekly_units',
        'mental_wellbeing',
        'self_assessed_health'
    ],
    var_name='measurement',
    value_name='value'
)

health_survey_long.head()
Out[14]:
id measurement value
0 1 fruit_vegetable_consumption_mean_daily_portions 3.40
1 2 fruit_vegetable_consumption_mean_daily_portions 3.19
2 3 fruit_vegetable_consumption_mean_daily_portions 3.30
3 4 fruit_vegetable_consumption_mean_daily_portions 3.30
4 5 fruit_vegetable_consumption_mean_daily_portions 3.20

Note: All entries in the "value" column must be of the same type, so you cannot combine types like categorical and integer.

As you can see, the first few rows correspond to the same measurement but for different entries. This means that there are more rows in the long format DataFrame but fewer columns.

Converting from long to wide format¶

We can convert our DataFrame back from long to wide format using the pivot() function. In this function, index specifies the column that contains the unique identifiers, columns specifies the column that contains the variable names, and values specifies the column that contains the values.

In [15]:
# Convert long format back to wide format
health_survey_wide = health_survey_long.pivot(
    index='id',
    columns='measurement',
    values='value'
)

health_survey_wide.head()
Out[15]:
measurement alcohol_consumption_mean_weekly_units fruit_vegetable_consumption_mean_daily_portions mental_wellbeing self_assessed_health
id
1 0.00 3.40 3.40 72.2
2 17.50 3.19 3.19 76.8
3 1.91 3.30 3.30 78.3
4 0.00 3.30 3.30 71.2
5 22.04 3.20 3.20 71.2

Log Transformations¶

Log transformations can be helpful in correcting skewed data by compressing the range of the data. Skewness measures the asymmetry of the distribution of values in a dataset. A skewness value greater than 1 or less than -1 indicates a highly skewed distribution, and such columns might benefit from log transformation. This transformation can make the data more normally distributed and suitable for further analysis.

First, we inspect the skewness of each numeric column in the DataFrame using the skew() function in pandas:

In [16]:
numeric_columns = ['alcohol_consumption_mean_weekly_units', 'fruit_vegetable_consumption_mean_daily_portions', 'mental_wellbeing', 'self_assessed_health']

skewness = health_survey_clean[numeric_columns].skew()

print(skewness)
alcohol_consumption_mean_weekly_units              3.999931
fruit_vegetable_consumption_mean_daily_portions   -0.290814
mental_wellbeing                                  -0.290814
self_assessed_health                              -0.233775
dtype: float64

A skewness value greater than 1 or less than -1 indicates a highly skewed distribution. We can see that alcohol_consumption_mean_weekly_units has a skewness value of 4, indicating it is strongly right-skewed.

Let's correct the skewness by applying a log transformation to the alcohol_consumption_mean_weekly_units column. We can use the transform(np.log10) function to do this:

In [17]:
health_survey_clean['log_alcohol_consumption'] = health_survey_clean['alcohol_consumption_mean_weekly_units'].transform(np.log10)

health_survey_clean.head()
Out[17]:
id year sex fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health overall_health log_alcohol_consumption
0 1 2021 Female 3.40 0.00 3.40 72.2 20.820 -inf
1 2 2012 Male 3.19 17.50 3.19 76.8 11.690 1.243038
2 3 2019 Female 3.30 1.91 3.30 78.3 20.075 0.281033
3 4 2009 Male 3.30 0.00 3.30 71.2 20.320 -inf
4 5 2010 Female 3.20 22.04 3.20 71.2 8.900 1.343212