Exercise 2: Data Cleaning and Preprocessing

Project Setup

Open RStudio and create a new R Script or R Markdown, depending on your preference. Name your new R Script or R Markdown “scottish_health_survey”.

The first step is to install (if you have not yet done so) and load all the relevant packages. As mentioned in the previous exercise, we will be using the tidyverse package for data manipulation and visualization.

Once the tidyverse package is installed, load the package into the R Script or R Markdown:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data exploration

For this tutorial, we will use a dataset containing data on various health indicators, behaviors, and outcomes for the Scottish population (Scottish Government, 2022).

Download the CSV file “scottish_health_data.csv” from the MANTRA website and move it to the “data” folder in your working directory.

Next, import the dataset, naming it health_survey.

health_survey <- read.csv("data/scottish_health_data.csv")

The health_survey dataset should appear in the Environment tab of RStudio.

There are various functions that allow you to do some initial exploration of data and get a feel for the dataset’s features.

For example, the head() function displays the first six rows of the dataset:

head(health_survey)

If you are using R Script rather than R Markdown and would like to see the results printed into the console, you will need to use the print() function.

Our dataset provides details on the year the data was collected, the sex of the individuals, their average daily consumption of fruits and vegetables, their average weekly alcohol consumption, their mental wellbeing scores, and their self-assessed general health status.

The tail() function displays the last six rows of the dataset:

tail(health_survey)

Use the dim() function to find out the dimensions of the dataset:

dim(health_survey)

## [1] 631   7

This tells us that the dataset has 630 rows and 7 columns.

Use the str() function to get a concise summary of the dataset, including the data types of each column:

str(health_survey)

## 'data.frame':    631 obs. of  7 variables:
##  $ id                                             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year                                           : int  2021 2012 2019 2009 2010 2022 2014 2022 2016 2008 ...
##  $ sex                                            : chr  "Female" "Male" "Female" "Male" ...
##  $ fruit_vegetable_consumption_mean_daily_portions: num  3.4 3.19 3.3 3.3 3.2 3.28 3.1 3.21 3 3.3 ...
##  $ alcohol_consumption_mean_weekly_units          : num  0 17.5 1.91 0 22.04 ...
##  $ mental_wellbeing                               : num  3.4 3.19 3.3 3.3 3.2 3.28 3.1 3.21 3 3.3 ...
##  $ self_assessed_health                           : num  72.2 76.8 78.3 71.2 71.2 75.1 71.3 69 75 78.6 ...

The columns “fruit_vegetable_consumption_mean_daily_portions”, “alcohol_consumption_mean_weekly_units”, “mental_wellbeing”, and “self_assessed_health” are all numeric (floating-point) data types, while “year” is an integer data type, and “sex” is a character data type.

Finally, use the summary() function to get summary statistics for each column in the dataset:

summary(health_survey)

##        id             year          sex           
##  Min.   :  1.0   Min.   :2008   Length:631        
##  1st Qu.:157.5   1st Qu.:2011   Class :character  
##  Median :315.0   Median :2014   Mode  :character  
##  Mean   :315.0   Mean   :2015                     
##  3rd Qu.:472.5   3rd Qu.:2018                     
##  Max.   :630.0   Max.   :2022                     
##                                                   
##  fruit_vegetable_consumption_mean_daily_portions
##  Min.   :2.700                                  
##  1st Qu.:3.180                                  
##  Median :3.200                                  
##  Mean   :3.209                                  
##  3rd Qu.:3.240                                  
##  Max.   :3.600                                  
##  NA's   :2                                      
##  alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
##  Min.   :  0.0                         Min.   :2.700    Min.   :67.00       
##  1st Qu.:  9.3                         1st Qu.:3.180    1st Qu.:72.00       
##  Median : 13.0                         Median :3.200    Median :74.00       
##  Mean   : 15.1                         Mean   :3.209    Mean   :73.83       
##  3rd Qu.: 16.7                         3rd Qu.:3.240    3rd Qu.:75.70       
##  Max.   :126.9                         Max.   :3.600    Max.   :79.50       
##  NA's   :2                             NA's   :5        NA's   :3

For numeric and integer variables, this summary provides the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values of each column.

Data Type Conversion

In R, you can convert objects from one type to another using various as.* functions. Data type conversion is particularly useful when you need to manipulate or analyze data in a specific format. Here are some examples of common data type conversions:

Converting logical to numeric:

as.numeric(FALSE)

## [1] 0

Converting numeric to logical:

as.logical(1)

## [1] TRUE

Converting character to numeric:

as.numeric("123.45")

## [1] 123.45

Converting numeric to character:

as.character(123.45)

## [1] "123.45"

Converting dates:

as.Date("2023-06-21") # Returns a Date object representing June 21, 2023

## [1] "2023-06-21"

Converting numeric to integer:

as.integer(123.45)

## [1] 123

Factors in R are used to handle categorical data. Factors are variables that take on a limited number of different values. In R, factors are stored as integers with a corresponding set of character values to use when the factor is displayed. This makes factors useful for handling categorical variables, such as “sex” or “blood type”.

Converting character to factor:

blood_type <- as.factor(c("O", "A", "B", "AB"))

blood_type # Categorical variable with 4 levels

## [1] O  A  B  AB
## Levels: A AB B O

Converting factor to character:

as.character(blood_type)  # Returns a character vector

## [1] "O"  "A"  "B"  "AB"

In our dataset, let’s convert the “sex” variable to a factor, using the $ sign to select the variable and the factor() function:

health_survey$sex <- factor(health_survey$sex)

This ensures that “sex” is treated as a categorical variable with specific levels (“Male”/“Female”), making it more suitable for analysis.

Use the levels() function to check what the possible levels are for the variable:

levels(health_survey$sex)

## [1] "Female" "Male"

When using the levels() function, it automatically assigns numerical values to categorical levels, encoding “Female” as 1 and “Male” as 2. This facilitates easier analysis by converting categorical data into a numerical format suitable for various statistical methods.

Since we are only given information on the year the data was collected, rather than the full date, it is unsuitable to use the Date type to represent the “year” column. We will leave it in its integer form.

Use the str() function to recheck the data types of each column:

str(health_survey)

## 'data.frame':    631 obs. of  7 variables:
##  $ id                                             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year                                           : int  2021 2012 2019 2009 2010 2022 2014 2022 2016 2008 ...
##  $ sex                                            : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 2 2 2 1 ...
##  $ fruit_vegetable_consumption_mean_daily_portions: num  3.4 3.19 3.3 3.3 3.2 3.28 3.1 3.21 3 3.3 ...
##  $ alcohol_consumption_mean_weekly_units          : num  0 17.5 1.91 0 22.04 ...
##  $ mental_wellbeing                               : num  3.4 3.19 3.3 3.3 3.2 3.28 3.1 3.21 3 3.3 ...
##  $ self_assessed_health                           : num  72.2 76.8 78.3 71.2 71.2 75.1 71.3 69 75 78.6 ...

Now, each column in our dataset is correctly formatted to its most suitable type and ready for analysis.

Missing values

Missing values in datasets can cause problems in data analysis: they can skew results and lead to incorrect conclusions. Therefore, before starting our analysis, we must check for and handle missing values to ensure that our analysis is accurate and reliable.

First, we must identify the missing values. Use the any() function combined with is.na() to check if there are any missing values in the dataset:

any(is.na(health_survey))

## [1] TRUE

Since this returns true, we know that health_survey has some missing values.

To see which rows contain the missing values, use the filter() function along with rowSums() to filter out rows that have any missing values:

health_survey_na <- health_survey %>%
  filter(rowSums(is.na(.)) > 0)

health_survey_na

We can see that there are 10 entries with missing values in the dataset.

There are various ways of handling missing values depending on the type of data and the goal of analysis. Some common methods include:

Remove rows or columns with many missing values using na.omit().
Replace the missing values with mean, median, or mode using replace_na().
Using machine learning models to predict and fill in the missing values.

Given the size of our dataset and the fact that rows with missing values constitute only a small fraction (about 1.6%), removing these rows will not significantly impact our analysis. Therefore, to ensure the dataset is clean and ready for analysis, we will remove the rows containing missing values:

health_survey_clean <- na.omit(health_survey)

Check the dataset again to ensure that there are no missing values left:

any(is.na(health_survey_clean))

## [1] FALSE

We can check the number of rows of our cleaned dataset using nrow():

nrow(health_survey_clean)

## [1] 621

We can see that we have removed 10 rows from the original health_survey dataset.

Duplicates

Sometimes, there may be duplicated rows in a dataset. We can use the duplicated() function to identify duplicate rows:

any(duplicated(health_survey_clean))

## [1] TRUE

We see that there are duplicated values in our dataset.

The unique() function can be used to view the unique rows in the dataset:

health_survey_unique <- unique(health_survey_clean)

nrow(health_survey_unique)

## [1] 620

This has one less row than the health_survey_clean dataset.

Let’s remove this duplicated row in our dataset using the distinct() function from the dplyr package:

health_survey_clean <- health_survey_clean %>% distinct()

Our dataset is now clean and ready for analysis. Move on to Exercise 3 to learn about data transformations.