Exercise 2: Data Cleaning and Preprocessing in Python¶

Now that we have a good understanding of the structure of our dataset, let's start to clean and preprocess the data. This section of the tutorial focuses on identifying and handling missing values, identifying and removing duplicates, and converting data types to ensure data consistency and cleanliness.

Handling Missing Values¶

Missing values in datasets can cause problems in data analysis: they can skew results and lead to incorrect conclusions. Therefore, before starting our analysis, we must check for and handle missing values to ensure that our analysis is accurate and reliable.

First, we must identify the missing values. Use the isnull() function combined with any() to check if there are any missing values in the dataset:

In [2]:

missing_values = health_survey.isnull().any().any()
print(missing_values)

True

Since this returns True, we know that health_survey has some missing values.

To see which rows contain the missing values, we can use the isnull() function along with any() to filter out rows that have any missing values:

In [3]:

health_survey[health_survey.isnull().any(axis=1)]

Out[3]:

	id	year	sex	fruit_vegetable_consumption_mean_daily_portions	alcohol_consumption_mean_weekly_units	mental_wellbeing	self_assessed_health
13	14	2011	Male	3.20	10.45	NaN	73.5
117	118	2016	Male	NaN	16.90	NaN	73.0
178	179	2012	Male	3.20	14.10	3.20	NaN
194	195	2009	Male	3.19	8.00	3.19	NaN
228	229	2008	Male	3.17	12.80	NaN	76.9
243	244	2022	Male	3.28	17.00	NaN	74.4
298	299	2016	Male	NaN	58.45	NaN	73.4
414	415	2013	Female	3.25	NaN	3.25	71.5
462	463	2009	Male	3.19	NaN	3.19	74.1
525	526	2008	Male	3.18	13.80	3.18	NaN

We can see the rows with missing values in the dataset.

There are various ways of handling missing values depending on the type of data and the goal of analysis. Some common methods include:

Remove rows or columns with many missing values using dropna().
Replace the missing values with mean, median, or mode using fillna().
Using machine learning models to predict and fill in the missing values.

Given the size of our dataset and the fact that rows with missing values constitute only a small fraction (about 1.6%), removing these rows will not significantly impact our analysis. Therefore, to ensure the dataset is clean and ready for analysis, we will remove the rows containing missing values:

In [4]:

health_survey_clean = health_survey.dropna()

Double-check to make sure there aren't any missing values left:

In [5]:

health_survey_clean.isnull().any().any()

Out[5]:

False

Check the number of rows of our cleaned dataset using shape:

In [6]:

health_survey_clean.shape[0]

Out[6]:

We can see that we have removed the rows with missing values from the original health_survey dataset.

Handling Duplicates¶

Sometimes there may be duplicated rows in a dataset. We can use the duplicated() function to identify duplicate rows:

In [7]:

duplicates = health_survey_clean.duplicated().any()
print(duplicates)

True

To identify duplicated rows in the dataset, we can group them together and count their occurrences using the value_counts() function. This function allows us to see how many times each row appears in the dataset, highlighting any duplicates:

In [8]:

row_counts = health_survey_clean.value_counts()

row_counts.head()

Out[8]:

id   year  sex     fruit_vegetable_consumption_mean_daily_portions  alcohol_consumption_mean_weekly_units  mental_wellbeing  self_assessed_health
11   2008  Male    3.16                                             20.3                                   3.16              77.0                    2
416  2018  Female  3.22                                             15.0                                   3.22              77.1                    1
418  2015  Male    3.17                                             15.8                                   3.17              70.0                    1
419  2014  Male    3.20                                             15.0                                   3.20              76.4                    1
420  2013  Male    3.19                                             14.6                                   3.19              75.0                    1
Name: count, dtype: int64

This prints each unique row in descending order of their frequency, showing the number of times each row is duplicated. The rows with the highest counts appear at the top, making it easy to identify and quantify duplicates.

In this case, we have one row that appears twice in the dataset.

Let's remove the duplicated row in our dataset using the drop_duplicates() function:

In [9]:

health_survey_clean = health_survey_clean.drop_duplicates()

This should show one less row than the health_survey_clean dataset if there was a duplicate. We can check this by checking the number of rows of the cleaned dataset using shape:

In [10]:

shape_clean = health_survey_clean.shape
print(shape_clean)

(620, 7)

To show only the row number and not the column number from the shape attribute, we can use square brackets [] to access the first element of the tuple.

The shape attribute returns a tuple (number_of_rows, number_of_columns), and by slicing this tuple with shape[0], we retrieve only the number of rows:

In [11]:

num_rows_clean = health_survey_clean.shape[0]
print(num_rows_clean)

Converting Data Types¶

Converting data types is an important step in data preprocessing. It can optimize memory usage and improve the performance of data manipulation operations. In Python, we can use the astype() function to convert data types.

For example, we can convert the column sex from an object (string) to a category, which is more appropriate for categorical data.

First, we must select the column sex from the DataFrame. We can use square brackets [] to access the specified column:

In [12]:

sex = health_survey_clean['sex']
sex.head()

Out[12]:

0    Female
1      Male
2    Female
3      Male
4    Female
Name: sex, dtype: object

This shows the first 5 elements of the sex column.

We can check the type of the sex column using dtype:

In [13]:

health_survey_clean['sex'].dtype

Out[13]:

dtype('O')

dtype('O') indicates that sex is an object type, since object types are the default type for string data and other non-numeric data in pandas.

Now, let's convert the sex column from object to category:

In [14]:

health_survey_clean['sex'] = health_survey_clean['sex'].astype('category')

Verify that the column has now changed types using dtype:

In [15]:

health_survey_clean['sex'].dtype

Out[15]:

CategoricalDtype(categories=['Female', 'Male'], ordered=False)

Now that we have removed NAs and duplicates, and each column is in the right data type, let's move on to learning how to perform data transformations.