Exercise 2: Data Cleaning and Preprocessing in Python¶

Now that we have a good understanding of the structure of our dataset, let's start to clean and preprocess the data. This section of the tutorial focuses on identifying and handling missing values, identifying and removing duplicates, and converting data types to ensure data consistency and cleanliness.

Handling Missing Values¶

Missing values in datasets can cause problems in data analysis: they can skew results and lead to incorrect conclusions. Therefore, before starting our analysis, we must check for and handle missing values to ensure that our analysis is accurate and reliable.

First, we must identify the missing values. Use the isnull() function combined with any() to check if there are any missing values in the dataset:

In [2]:
missing_values = health_survey.isnull().any().any()
print(missing_values)
True

Since this returns True, we know that health_survey has some missing values.

To see which rows contain the missing values, we can use the isnull() function along with any() to filter out rows that have any missing values:

In [3]:
health_survey[health_survey.isnull().any(axis=1)]
Out[3]:
id year sex fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
13 14 2011 Male 3.20 10.45 NaN 73.5
117 118 2016 Male NaN 16.90 NaN 73.0
178 179 2012 Male 3.20 14.10 3.20 NaN
194 195 2009 Male 3.19 8.00 3.19 NaN
228 229 2008 Male 3.17 12.80 NaN 76.9
243 244 2022 Male 3.28 17.00 NaN 74.4
298 299 2016 Male NaN 58.45 NaN 73.4
414 415 2013 Female 3.25 NaN 3.25 71.5
462 463 2009 Male 3.19 NaN 3.19 74.1
525 526 2008 Male 3.18 13.80 3.18 NaN

We can see the rows with missing values in the dataset.

There are various ways of handling missing values depending on the type of data and the goal of analysis. Some common methods include:

  1. Remove rows or columns with many missing values using dropna().
  2. Replace the missing values with mean, median, or mode using fillna().
  3. Using machine learning models to predict and fill in the missing values.

Given the size of our dataset and the fact that rows with missing values constitute only a small fraction (about 1.6%), removing these rows will not significantly impact our analysis. Therefore, to ensure the dataset is clean and ready for analysis, we will remove the rows containing missing values:

In [4]:
health_survey_clean = health_survey.dropna()

Double-check to make sure there aren't any missing values left:

In [5]:
health_survey_clean.isnull().any().any()
Out[5]:
False

Check the number of rows of our cleaned dataset using shape:

In [6]:
health_survey_clean.shape[0]
Out[6]:
621

We can see that we have removed the rows with missing values from the original health_survey dataset.

Handling Duplicates¶

Sometimes there may be duplicated rows in a dataset. We can use the duplicated() function to identify duplicate rows:

In [7]:
duplicates = health_survey_clean.duplicated().any()
print(duplicates)
True

To identify duplicated rows in the dataset, we can group them together and count their occurrences using the value_counts() function. This function allows us to see how many times each row appears in the dataset, highlighting any duplicates:

In [8]:
row_counts = health_survey_clean.value_counts()

row_counts.head()
Out[8]:
id   year  sex     fruit_vegetable_consumption_mean_daily_portions  alcohol_consumption_mean_weekly_units  mental_wellbeing  self_assessed_health
11   2008  Male    3.16                                             20.3                                   3.16              77.0                    2
416  2018  Female  3.22                                             15.0                                   3.22              77.1                    1
418  2015  Male    3.17                                             15.8                                   3.17              70.0                    1
419  2014  Male    3.20                                             15.0                                   3.20              76.4                    1
420  2013  Male    3.19                                             14.6                                   3.19              75.0                    1
Name: count, dtype: int64

This prints each unique row in descending order of their frequency, showing the number of times each row is duplicated. The rows with the highest counts appear at the top, making it easy to identify and quantify duplicates.

In this case, we have one row that appears twice in the dataset.

Let's remove the duplicated row in our dataset using the drop_duplicates() function:

In [9]:
health_survey_clean = health_survey_clean.drop_duplicates()

This should show one less row than the health_survey_clean dataset if there was a duplicate. We can check this by checking the number of rows of the cleaned dataset using shape:

In [10]:
shape_clean = health_survey_clean.shape
print(shape_clean)
(620, 7)

To show only the row number and not the column number from the shape attribute, we can use square brackets [] to access the first element of the tuple.

The shape attribute returns a tuple (number_of_rows, number_of_columns), and by slicing this tuple with shape[0], we retrieve only the number of rows:

In [11]:
num_rows_clean = health_survey_clean.shape[0]
print(num_rows_clean)
620

Converting Data Types¶

Converting data types is an important step in data preprocessing. It can optimize memory usage and improve the performance of data manipulation operations. In Python, we can use the astype() function to convert data types.

For example, we can convert the column sex from an object (string) to a category, which is more appropriate for categorical data.

First, we must select the column sex from the DataFrame. We can use square brackets [] to access the specified column:

In [12]:
sex = health_survey_clean['sex']
sex.head()
Out[12]:
0    Female
1      Male
2    Female
3      Male
4    Female
Name: sex, dtype: object

This shows the first 5 elements of the sex column.

We can check the type of the sex column using dtype:

In [13]:
health_survey_clean['sex'].dtype
Out[13]:
dtype('O')

dtype('O') indicates that sex is an object type, since object types are the default type for string data and other non-numeric data in pandas.

Now, let's convert the sex column from object to category:

In [14]:
health_survey_clean['sex'] = health_survey_clean['sex'].astype('category')

Verify that the column has now changed types using dtype:

In [15]:
health_survey_clean['sex'].dtype
Out[15]:
CategoricalDtype(categories=['Female', 'Male'], ordered=False)

Now that we have removed NAs and duplicates, and each column is in the right data type, let's move on to learning how to perform data transformations.