Open RStudio and create a new R Script or R Markdown, depending on your preference. Name your new R Script or R Markdown “scottish_health_survey”.
The first step is to install (if you have not yet done so) and load
all the relevant packages. As mentioned in the previous exercise, we
will be using the tidyverse
package for data manipulation
and visualization.
Once the tidyverse
package is installed, load the
package into the R Script or R Markdown:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
For this tutorial, we will use a dataset containing data on various health indicators, behaviors, and outcomes for the Scottish population (Scottish Government, 2022).
Download the CSV file “scottish_health_data.csv” from the MANTRA website and move it to the “data” folder in your working directory.
Next, import the dataset, naming it health_survey
.
health_survey <- read.csv("data/scottish_health_data.csv")
The health_survey
dataset should appear in the
Environment tab of RStudio.
There are various functions that allow you to do some initial exploration of data and get a feel for the dataset’s features.
For example, the head()
function displays the first six
rows of the dataset:
head(health_survey)
If you are using R Script rather than R Markdown and would like to
see the results printed into the console, you will need to use the
print()
function.
Our dataset provides details on the year the data was collected, the sex of the individuals, their average daily consumption of fruits and vegetables, their average weekly alcohol consumption, their mental wellbeing scores, and their self-assessed general health status.
The tail()
function displays the last six rows of the
dataset:
tail(health_survey)
Use the dim() function to find out the dimensions of the dataset:
dim(health_survey)
## [1] 631 7
This tells us that the dataset has 630 rows and 7 columns.
Use the str()
function to get a concise summary of the
dataset, including the data types of each column:
str(health_survey)
## 'data.frame': 631 obs. of 7 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2021 2012 2019 2009 2010 2022 2014 2022 2016 2008 ...
## $ sex : chr "Female" "Male" "Female" "Male" ...
## $ fruit_vegetable_consumption_mean_daily_portions: num 3.4 3.19 3.3 3.3 3.2 3.28 3.1 3.21 3 3.3 ...
## $ alcohol_consumption_mean_weekly_units : num 0 17.5 1.91 0 22.04 ...
## $ mental_wellbeing : num 3.4 3.19 3.3 3.3 3.2 3.28 3.1 3.21 3 3.3 ...
## $ self_assessed_health : num 72.2 76.8 78.3 71.2 71.2 75.1 71.3 69 75 78.6 ...
The columns “fruit_vegetable_consumption_mean_daily_portions”, “alcohol_consumption_mean_weekly_units”, “mental_wellbeing”, and “self_assessed_health” are all numeric (floating-point) data types, while “year” is an integer data type, and “sex” is a character data type.
Finally, use the summary()
function to get summary
statistics for each column in the dataset:
summary(health_survey)
## id year sex
## Min. : 1.0 Min. :2008 Length:631
## 1st Qu.:157.5 1st Qu.:2011 Class :character
## Median :315.0 Median :2014 Mode :character
## Mean :315.0 Mean :2015
## 3rd Qu.:472.5 3rd Qu.:2018
## Max. :630.0 Max. :2022
##
## fruit_vegetable_consumption_mean_daily_portions
## Min. :2.700
## 1st Qu.:3.180
## Median :3.200
## Mean :3.209
## 3rd Qu.:3.240
## Max. :3.600
## NA's :2
## alcohol_consumption_mean_weekly_units mental_wellbeing self_assessed_health
## Min. : 0.0 Min. :2.700 Min. :67.00
## 1st Qu.: 9.3 1st Qu.:3.180 1st Qu.:72.00
## Median : 13.0 Median :3.200 Median :74.00
## Mean : 15.1 Mean :3.209 Mean :73.83
## 3rd Qu.: 16.7 3rd Qu.:3.240 3rd Qu.:75.70
## Max. :126.9 Max. :3.600 Max. :79.50
## NA's :2 NA's :5 NA's :3
For numeric and integer variables, this summary provides the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values of each column.
In R, you can convert objects from one type to another using various
as.*
functions. Data type conversion is particularly useful
when you need to manipulate or analyze data in a specific format. Here
are some examples of common data type conversions:
as.numeric(FALSE)
## [1] 0
as.logical(1)
## [1] TRUE
as.numeric("123.45")
## [1] 123.45
as.character(123.45)
## [1] "123.45"
as.Date("2023-06-21") # Returns a Date object representing June 21, 2023
## [1] "2023-06-21"
as.integer(123.45)
## [1] 123
Factors in R are used to handle categorical data. Factors are variables that take on a limited number of different values. In R, factors are stored as integers with a corresponding set of character values to use when the factor is displayed. This makes factors useful for handling categorical variables, such as “sex” or “blood type”.
blood_type <- as.factor(c("O", "A", "B", "AB"))
blood_type # Categorical variable with 4 levels
## [1] O A B AB
## Levels: A AB B O
as.character(blood_type) # Returns a character vector
## [1] "O" "A" "B" "AB"
In our dataset, let’s convert the “sex” variable to a factor, using
the $ sign to select the variable and the factor()
function:
health_survey$sex <- factor(health_survey$sex)
This ensures that “sex” is treated as a categorical variable with specific levels (“Male”/“Female”), making it more suitable for analysis.
Use the levels()
function to check what the possible
levels are for the variable:
levels(health_survey$sex)
## [1] "Female" "Male"
When using the levels()
function, it automatically
assigns numerical values to categorical levels, encoding “Female” as 1
and “Male” as 2. This facilitates easier analysis by converting
categorical data into a numerical format suitable for various
statistical methods.
Since we are only given information on the year the data was collected, rather than the full date, it is unsuitable to use the Date type to represent the “year” column. We will leave it in its integer form.
Use the str()
function to recheck the data types of each
column:
str(health_survey)
## 'data.frame': 631 obs. of 7 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2021 2012 2019 2009 2010 2022 2014 2022 2016 2008 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 2 2 2 1 ...
## $ fruit_vegetable_consumption_mean_daily_portions: num 3.4 3.19 3.3 3.3 3.2 3.28 3.1 3.21 3 3.3 ...
## $ alcohol_consumption_mean_weekly_units : num 0 17.5 1.91 0 22.04 ...
## $ mental_wellbeing : num 3.4 3.19 3.3 3.3 3.2 3.28 3.1 3.21 3 3.3 ...
## $ self_assessed_health : num 72.2 76.8 78.3 71.2 71.2 75.1 71.3 69 75 78.6 ...
Now, each column in our dataset is correctly formatted to its most suitable type and ready for analysis.
Missing values in datasets can cause problems in data analysis: they can skew results and lead to incorrect conclusions. Therefore, before starting our analysis, we must check for and handle missing values to ensure that our analysis is accurate and reliable.
First, we must identify the missing values. Use the
any()
function combined with is.na()
to check
if there are any missing values in the dataset:
any(is.na(health_survey))
## [1] TRUE
Since this returns true, we know that health_survey
has
some missing values.
To see which rows contain the missing values, use the
filter()
function along with rowSums()
to
filter out rows that have any missing values:
health_survey_na <- health_survey %>%
filter(rowSums(is.na(.)) > 0)
health_survey_na
We can see that there are 10 entries with missing values in the dataset.
There are various ways of handling missing values depending on the type of data and the goal of analysis. Some common methods include:
na.omit()
.replace_na()
.Given the size of our dataset and the fact that rows with missing values constitute only a small fraction (about 1.6%), removing these rows will not significantly impact our analysis. Therefore, to ensure the dataset is clean and ready for analysis, we will remove the rows containing missing values:
health_survey_clean <- na.omit(health_survey)
Check the dataset again to ensure that there are no missing values left:
any(is.na(health_survey_clean))
## [1] FALSE
We can check the number of rows of our cleaned dataset using
nrow()
:
nrow(health_survey_clean)
## [1] 621
We can see that we have removed 10 rows from the original
health_survey
dataset.
Sometimes, there may be duplicated rows in a dataset. We can use the
duplicated()
function to identify duplicate rows:
any(duplicated(health_survey_clean))
## [1] TRUE
We see that there are duplicated values in our dataset.
The unique()
function can be used to view the unique
rows in the dataset:
health_survey_unique <- unique(health_survey_clean)
nrow(health_survey_unique)
## [1] 620
This has one less row than the health_survey_clean
dataset.
Let’s remove this duplicated row in our dataset using the
distinct()
function from the dplyr
package:
health_survey_clean <- health_survey_clean %>% distinct()
Our dataset is now clean and ready for analysis. Move on to Exercise 3 to learn about data transformations.