Exercise 2: Data Cleaning and Preprocessing in Python¶
Now that we have a good understanding of the structure of our dataset, let's start to clean and preprocess the data. This section of the tutorial focuses on identifying and handling missing values, identifying and removing duplicates, and converting data types to ensure data consistency and cleanliness.
Handling Missing Values¶
Missing values in datasets can cause problems in data analysis: they can skew results and lead to incorrect conclusions. Therefore, before starting our analysis, we must check for and handle missing values to ensure that our analysis is accurate and reliable.
First, we must identify the missing values. Use the isnull()
function combined with any()
to check if there are any missing values in the dataset:
missing_values = health_survey.isnull().any().any()
print(missing_values)
True
Since this returns True
, we know that health_survey has some missing values.
To see which rows contain the missing values, we can use the isnull()
function along with any()
to filter out rows that have any missing values:
health_survey[health_survey.isnull().any(axis=1)]
id | year | sex | fruit_vegetable_consumption_mean_daily_portions | alcohol_consumption_mean_weekly_units | mental_wellbeing | self_assessed_health | |
---|---|---|---|---|---|---|---|
13 | 14 | 2011 | Male | 3.20 | 10.45 | NaN | 73.5 |
117 | 118 | 2016 | Male | NaN | 16.90 | NaN | 73.0 |
178 | 179 | 2012 | Male | 3.20 | 14.10 | 3.20 | NaN |
194 | 195 | 2009 | Male | 3.19 | 8.00 | 3.19 | NaN |
228 | 229 | 2008 | Male | 3.17 | 12.80 | NaN | 76.9 |
243 | 244 | 2022 | Male | 3.28 | 17.00 | NaN | 74.4 |
298 | 299 | 2016 | Male | NaN | 58.45 | NaN | 73.4 |
414 | 415 | 2013 | Female | 3.25 | NaN | 3.25 | 71.5 |
462 | 463 | 2009 | Male | 3.19 | NaN | 3.19 | 74.1 |
525 | 526 | 2008 | Male | 3.18 | 13.80 | 3.18 | NaN |
We can see the rows with missing values in the dataset.
There are various ways of handling missing values depending on the type of data and the goal of analysis. Some common methods include:
- Remove rows or columns with many missing values using
dropna()
. - Replace the missing values with mean, median, or mode using
fillna()
. - Using machine learning models to predict and fill in the missing values.
Given the size of our dataset and the fact that rows with missing values constitute only a small fraction (about 1.6%), removing these rows will not significantly impact our analysis. Therefore, to ensure the dataset is clean and ready for analysis, we will remove the rows containing missing values:
health_survey_clean = health_survey.dropna()
Double-check to make sure there aren't any missing values left:
health_survey_clean.isnull().any().any()
False
Check the number of rows of our cleaned dataset using shape
:
health_survey_clean.shape[0]
621
We can see that we have removed the rows with missing values from the original health_survey
dataset.
Handling Duplicates¶
Sometimes there may be duplicated rows in a dataset. We can use the duplicated()
function to identify duplicate rows:
duplicates = health_survey_clean.duplicated().any()
print(duplicates)
True
To identify duplicated rows in the dataset, we can group them together and count their occurrences using the value_counts() function. This function allows us to see how many times each row appears in the dataset, highlighting any duplicates:
row_counts = health_survey_clean.value_counts()
row_counts.head()
id year sex fruit_vegetable_consumption_mean_daily_portions alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health 11 2008 Male 3.16 20.3 3.16 77.0 2 416 2018 Female 3.22 15.0 3.22 77.1 1 418 2015 Male 3.17 15.8 3.17 70.0 1 419 2014 Male 3.20 15.0 3.20 76.4 1 420 2013 Male 3.19 14.6 3.19 75.0 1 Name: count, dtype: int64
This prints each unique row in descending order of their frequency, showing the number of times each row is duplicated. The rows with the highest counts appear at the top, making it easy to identify and quantify duplicates.
In this case, we have one row that appears twice in the dataset.
Let's remove the duplicated row in our dataset using the drop_duplicates()
function:
health_survey_clean = health_survey_clean.drop_duplicates()
This should show one less row than the health_survey_clean
dataset if there was a duplicate. We can check this by checking the number of rows of the cleaned dataset using shape
:
shape_clean = health_survey_clean.shape
print(shape_clean)
(620, 7)
To show only the row number and not the column number from the shape attribute, we can use square brackets []
to access the first element of the tuple.
The shape
attribute returns a tuple (number_of_rows
, number_of_columns
), and by slicing this tuple with shape[0]
, we retrieve only the number of rows:
num_rows_clean = health_survey_clean.shape[0]
print(num_rows_clean)
620
Converting Data Types¶
Converting data types is an important step in data preprocessing. It can optimize memory usage and improve the performance of data manipulation operations. In Python, we can use the astype()
function to convert data types.
For example, we can convert the column sex
from an object (string) to a category, which is more appropriate for categorical data.
First, we must select the column sex
from the DataFrame. We can use square brackets []
to access the specified column:
sex = health_survey_clean['sex']
sex.head()
0 Female 1 Male 2 Female 3 Male 4 Female Name: sex, dtype: object
This shows the first 5 elements of the sex
column.
We can check the type of the sex
column using dtype
:
health_survey_clean['sex'].dtype
dtype('O')
dtype('O')
indicates that sex
is an object type, since object types are the default type for string data and other non-numeric data in pandas.
Now, let's convert the sex
column from object to category:
health_survey_clean['sex'] = health_survey_clean['sex'].astype('category')
Verify that the column has now changed types using dtype
:
health_survey_clean['sex'].dtype
CategoricalDtype(categories=['Female', 'Male'], ordered=False)
Now that we have removed NAs and duplicates, and each column is in the right data type, let's move on to learning how to perform data transformations.