Exercise 3: Data Transformation in Python¶

Now that we have cleaned our dataset, it is ready to be analysed. This exercise teaches how to sort and filter data, summarize and aggregate data, add and rename columns, reshape data, perform log transformations, and export data to CSV or Excel.

In [1]:

from IPython.display import display, HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

health_survey_clean = pd.read_csv("sample_health_data.csv").dropna().drop_duplicates()
health_survey_clean['sex'] = health_survey_clean['sex'].astype('category')

Filter Data¶

To filter for entries that meet certain criteria, we can use the query() function or square bracket [] notation. We already know how to filter for specific columns using square bracket [] notation, as we did in the previous exercise, where we filtered for the sex column.

Let's take this up a notch, using square brackets to filter for rows where the sex column is 'Female':

In [2]:

condition = health_survey_clean['sex'] == 'Female'
condition

Out[2]:

0       True
1      False
2       True
3      False
4       True
       ...  
625     True
626     True
627     True
628     True
629     True
Name: sex, Length: 620, dtype: bool

This prints a series containing True/False values for each row, indicating whether the condition is met.

We can then use this boolean series to filter the DataFrame, including only the rows where the condition is True:

In [3]:

female_entries = health_survey_clean[condition]

female_entries.head()

Out[3]:

	id	year	sex	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health
0	1	2021	Female	3.4	0.00	3.4	72.2
2	3	2019	Female	3.3	1.91	3.3	78.3
4	5	2010	Female	3.2	22.04	3.2	71.2
9	10	2008	Female	3.3	0.00	3.3	78.6
14	15	2010	Female	3.3	0.00	3.3	77.1

Alternatively, we can achieve the same result using the query() function, giving it the condition sex == "Female", for a more readable syntax:

In [4]:

female_entries = health_survey_clean.query('sex == "Female"')

female_entries.head()

Out[4]:

	id	year	sex	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health
0	1	2021	Female	3.4	0.00	3.4	72.2
2	3	2019	Female	3.3	1.91	3.3	78.3
4	5	2010	Female	3.2	22.04	3.2	71.2
9	10	2008	Female	3.3	0.00	3.3	78.6
14	15	2010	Female	3.3	0.00	3.3	77.1

Sort Data¶

To sort the data, we can use the sort_values() function. For instance, let's sort the dataset by alcohol_consumption_mean_weekly_units in descending order:

In [5]:

sorted_by_alcohol = health_survey_clean.sort_values(by='alcohol_consumption_mean_weekly_units', ascending=False)

sorted_by_alcohol.head()

Out[5]:

	id	year	sex	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health
475	476	2016	Male	2.7	126.90	2.7	75.0
42	43	2016	Male	2.8	104.44	2.8	76.8
438	439	2018	Male	2.8	101.20	2.8	71.2
440	441	2015	Male	2.8	96.28	2.8	68.4
376	377	2016	Female	2.9	85.55	2.9	76.0

Combining filtering and sorting allows us to perform more complex data manipulation seamlessly. In Python, we can achieve this by chaining the filter and sort functions together.

For example, let's filter for female entries and then sort them by alcohol consumption in descending order:

In [6]:

female_sorted_by_alcohol = health_survey_clean[condition].sort_values(by='alcohol_consumption_mean_weekly_units', ascending=False)

female_sorted_by_alcohol.head()

Out[6]:

	id	year	sex	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health
376	377	2016	Female	2.9	85.55	2.9	76.0
377	378	2012	Female	3.0	62.54	3.0	73.0
387	388	2015	Female	3.0	60.86	3.0	77.3
369	370	2014	Female	3.0	49.79	3.0	69.0
493	494	2014	Female	3.1	38.73	3.1	76.0

Summarize and Aggregate Data¶

Now that we can filter rows and select specific columns of our dataset, we can also summarize and aggregate this data to calculate summary statistics, such as the mean.

For example, we can group our dataset by year using the groupby() function:

In [7]:

health_survey_group = health_survey_clean.groupby('year')

Then, we can calculate the mean annual values of each numeric column using the agg() function, passing a dictionary using curly brackets {} where the keys are the column names and the values are the aggregation functions we want to apply (in this case, the mean).

For example, 'fruit_vegetable_consumption_mean_daily_portions': 'mean' means that for each group (each year), we calculate the mean of the fruit_vegetable_consumption_mean_daily_portions column.

In [8]:

health_survey_means = health_survey_group.agg({
    'fruit_vegetable_consumption_mean_daily_portions': 'mean',
    'alcohol_consumption_mean_weekly_units': 'mean',
    'mental_wellbeing': 'mean',
    'self_assessed_health': 'mean'
})

The agg() function allows us to apply the aggregation functions (mean() in this case) to each group and return the results in a new DataFrame. By default, agg() handles missing values automatically, but it's always good to ensure your data is clean.

Finally, we use reset_index() to convert the index back into a regular column, making the DataFrame easier to work with.

In [9]:

health_survey_summary = health_survey_means.reset_index()

health_survey_summary

Out[9]:

	year	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health
0	2008	3.179535	14.256512	3.179535	74.148837
1	2009	3.203256	12.937209	3.203256	74.646512
2	2010	3.178222	15.202667	3.178222	73.491111
3	2011	3.194318	13.092273	3.194318	74.581818
4	2012	3.169545	17.819091	3.169545	73.600000
5	2013	3.202955	13.719091	3.202955	74.268182
6	2014	3.189778	17.950444	3.189778	73.897778
7	2015	3.196222	17.189778	3.196222	73.993333
8	2016	3.179302	20.876047	3.179302	74.162791
9	2017	3.242667	12.333111	3.242667	73.873333
10	2018	3.218000	17.040000	3.218000	73.060000
11	2019	3.240889	12.678889	3.240889	72.808889
12	2021	3.279556	12.342222	3.279556	74.491111
13	2022	3.254091	13.163636	3.254091	72.622727

Let's rename the columns in the summary DataFrame for clarity. We'll use the rename function and pass a dictionary where the keys are the old column names and the values are the new column names:

In [10]:

health_survey_summary.rename(columns={
    'fruit_vegetable_consumption_mean_daily_portions': 'mean_fruit_vegetables',
    'alcohol_consumption_mean_weekly_units': 'mean_alcohol',
    'mental_wellbeing': 'mean_mental_wellbeing',
    'self_assessed_health': 'mean_self_assessed_health'
}, inplace=True)

health_survey_summary

Out[10]:

	year	mean_fruit_vegetables	mean_alcohol	mean_mental_wellbeing	mean_self_assessed_health
0	2008	3.179535	14.256512	3.179535	74.148837
1	2009	3.203256	12.937209	3.203256	74.646512
2	2010	3.178222	15.202667	3.178222	73.491111
3	2011	3.194318	13.092273	3.194318	74.581818
4	2012	3.169545	17.819091	3.169545	73.600000
5	2013	3.202955	13.719091	3.202955	74.268182
6	2014	3.189778	17.950444	3.189778	73.897778
7	2015	3.196222	17.189778	3.196222	73.993333
8	2016	3.179302	20.876047	3.179302	74.162791
9	2017	3.242667	12.333111	3.242667	73.873333
10	2018	3.218000	17.040000	3.218000	73.060000
11	2019	3.240889	12.678889	3.240889	72.808889
12	2021	3.279556	12.342222	3.279556	74.491111
13	2022	3.254091	13.163636	3.254091	72.622727

Now, we have created a new DataFrame that contains the mean values for fruit and vegetable consumption, alcohol consumption, mental wellbeing, and self-assessed health scores for each year. This summarized data allows us to make comparisons and observe trends over time.

Export DataFrame¶

Exporting your data to different file formats allows you to save your results and share them with others.

To export the DataFrame health_survey_summary to a CSV file, use the to_csv() function. This function takes the file path as an argument and saves the DataFrame to a CSV file.

In [11]:

health_survey_summary.to_csv('health_survey_summary.csv')

To export the DataFrame to an Excel file, we use the to_excel() function, giving the file path as an argument:

In [12]:

health_survey_summary.to_excel('health_survey_summary.xlsx')

You should see both 'health_survey_summary.csv' and 'health_survey_summary.xlsx' appear in your working directory.

Transform and Create New Variables¶

Next, we will learn how to add new columns to a DataFrame or modify existing ones. This is a useful tool for enhancing data interpretation, facilitating comparisons, improving model performance, and deriving meaningful insights.

Let's create a new variable that reflects an overall measure of health, combining diet, alcohol, mental wellbeing, and self-assessed health according to the formula:

overall_health = 3 x mean_fruit_vegetables - 0.5 x mean_alcohol + mean_mental_wellbeing + 0.1 x mean_self_assessed_health

We can use pandas to do this by creating a new column overall_health:

In [13]:

health_survey_clean['overall_health'] = (
    3 * health_survey_clean['fruit_vegetable_consumption_mean_daily_portions']
    - 0.5 * health_survey_clean['alcohol_consumption_mean_weekly_units']
    + health_survey_clean['mental_wellbeing']
    + 0.1 * health_survey_clean['self_assessed_health']
)

health_survey_clean.head()

Out[13]:

	id	year	sex	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health	overall_health
0	1	2021	Female	3.40	0.00	3.40	72.2	20.820
1	2	2012	Male	3.19	17.50	3.19	76.8	11.690
2	3	2019	Female	3.30	1.91	3.30	78.3	20.075
3	4	2009	Male	3.30	0.00	3.30	71.2	20.320
4	5	2010	Female	3.20	22.04	3.20	71.2	8.900

This calculates a new variable based on the formula, and the result of the calculation is added as a new column overall_health at the end of the DataFrame.

Reshaping Data¶

Sometimes we may want to reshape our data to make it easier to visualize or model. Pandas provides functions such as melt() and pivot() to transform data between wide and long formats.

Wide Format: Standard "Excel" format, where each variable is in a separate column. The health_survey DataFrame is in wide format.
Long Format: Data where each row contains one variable per subject.

Converting from wide to long format¶

Let's convert our health_survey_clean DataFrame from wide to long format using the melt() function. The id_vars parameter specifies which columns to keep as identifiers, value_var specifies the columns to pivot, var_name specifies the name of the new column that will contain the variable names, and value_name specifies the name of the new column that will contain the values:

In [14]:

health_survey_long = pd.melt(
    health_survey_clean,
    id_vars='id',
    value_vars=[
        'fruit_vegetable_consumption_mean_daily_portions',
        'alcohol_consumption_mean_weekly_units',
        'mental_wellbeing',
        'self_assessed_health'
    ],
    var_name='measurement',
    value_name='value'
)

health_survey_long.head()

Out[14]:

	id	measurement	value
0	1	fruit_vegetable_consumption_mean_daily_portions	3.40
1	2	fruit_vegetable_consumption_mean_daily_portions	3.19
2	3	fruit_vegetable_consumption_mean_daily_portions	3.30
3	4	fruit_vegetable_consumption_mean_daily_portions	3.30
4	5	fruit_vegetable_consumption_mean_daily_portions	3.20

Note: All entries in the "value" column must be of the same type, so you cannot combine types like categorical and integer.

As you can see, the first few rows correspond to the same measurement but for different entries. This means that there are more rows in the long format DataFrame but fewer columns.

Converting from long to wide format¶

We can convert our DataFrame back from long to wide format using the pivot() function. In this function, index specifies the column that contains the unique identifiers, columns specifies the column that contains the variable names, and values specifies the column that contains the values.

In [15]:

# Convert long format back to wide format
health_survey_wide = health_survey_long.pivot(
    index='id',
    columns='measurement',
    values='value'
)

health_survey_wide.head()

Out[15]:

measurement	alcohol_consumption_mean_weekly_units	fruit_vegetable_consumption_mean_daily_portions	mental_wellbeing	self_assessed_health
id
1	0.00	3.40	3.40	72.2
2	17.50	3.19	3.19	76.8
3	1.91	3.30	3.30	78.3
4	0.00	3.30	3.30	71.2
5	22.04	3.20	3.20	71.2

Log Transformations¶

Log transformations can be helpful in correcting skewed data by compressing the range of the data. Skewness measures the asymmetry of the distribution of values in a dataset. A skewness value greater than 1 or less than -1 indicates a highly skewed distribution, and such columns might benefit from log transformation. This transformation can make the data more normally distributed and suitable for further analysis.

First, we inspect the skewness of each numeric column in the DataFrame using the skew() function in pandas:

In [16]:

numeric_columns = ['alcohol_consumption_mean_weekly_units', 'fruit_vegetable_consumption_mean_daily_portions', 'mental_wellbeing', 'self_assessed_health']

skewness = health_survey_clean[numeric_columns].skew()

print(skewness)

alcohol_consumption_mean_weekly_units              3.999931
fruit_vegetable_consumption_mean_daily_portions   -0.290814
mental_wellbeing                                  -0.290814
self_assessed_health                              -0.233775
dtype: float64

A skewness value greater than 1 or less than -1 indicates a highly skewed distribution. We can see that alcohol_consumption_mean_weekly_units has a skewness value of 4, indicating it is strongly right-skewed.

Let's correct the skewness by applying a log transformation to the alcohol_consumption_mean_weekly_units column. We can use the transform(np.log10) function to do this:

In [17]:

health_survey_clean['log_alcohol_consumption'] = health_survey_clean['alcohol_consumption_mean_weekly_units'].transform(np.log10)

health_survey_clean.head()

Out[17]:

	id	year	sex	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health	overall_health	log_alcohol_consumption
0	1	2021	Female	3.40	0.00	3.40	72.2	20.820	-inf
1	2	2012	Male	3.19	17.50	3.19	76.8	11.690	1.243038
2	3	2019	Female	3.30	1.91	3.30	78.3	20.075	0.281033
3	4	2009	Male	3.30	0.00	3.30	71.2	20.320	-inf
4	5	2010	Female	3.20	22.04	3.20	71.2	8.900	1.343212