At this stage, we have a good understanding of our dataset’s characteristics, it has been cleaned, and it is now ready for analysis.
Open “scottish-health-survey.R” in RStudio. The first thing we will learn is how to filter and select specific rows and columns according to certain criteria.
We can do this using the filter()
and
select()
functions from dplyr
. The
filter()
function is used to filter rows based on
conditions, while the select()
function is used to select
specific columns.
For example, to filter for entries from the year 2019, we filter on
the condition that the entry in the year
column is equal to
“2019”. If this condition is met, that row is included, otherwise it is
excluded:
year_2019 <- health_survey_clean %>%
filter(year == "2019")
head(year_2019)
Next, we can select only the “fruit_vegetable_consumption_mean_daily_portions” and “alcohol_consumption_mean_weekly_units” columns of our dataset, dropping all other columns:
diet <- health_survey_clean %>%
select(fruit_vegetable_consumption_mean_daily_portions, alcohol_consumption_mean_weekly_units)
head(diet)
An alternative method to filter()
and
select()
is to use the base R function
subset()
, which takes the data frame as the first argument,
the subset condition as the second argument, and the columns to include
in the subset as the third argument.
Here, we are filtering for the sex “Female” and selecting the columns “mental_wellbeing” and “self_assessed_health”:
mental_health_f <- subset(health_survey_clean, sex == "Female", c(mental_wellbeing, self_assessed_health))
head(mental_health_f)
Another method for subsetting data frames in R is by using square
brackets []
. The syntax for subsetting a data frame is
data_frame[rows, columns]
, where rows specifies which rows
to include and columns specifies which columns to include
Try subsetting the data frame mental_health_f
so that it
only includes the first 10 rows:
mental_health_f_subset <- mental_health_f[1:10, ]
mental_health_f_subset
Here, we ask that the rows 1 to 10 are included, and we have left the second argument blank to indicate all columns are included.
This time, try subsetting the health_survey_clean
data
frame to include the first 10 rows and columns 3 to 5:
health_survey_subset <- health_survey_clean[1:10, 3:5]
health_survey_subset
To include all rows and all columns except the first, we use the
notation data_frame[ , -1]
:
health_survey_minus_first_column <- health_survey_clean[ , -1]
health_survey_minus_first_column
We can also subset non-incremental rows and columns using square
brackets. We do this by creating a vector of integers using the
c()
function, indicating which rows/columns to include.
For example, to subset columns 2 and 5 of the
health_survey_clean
data frame, we do the following:
health_survey_1_5 <- health_survey_clean[ , c(2,5)]
health_survey_1_5
Now that we can filter rows and select specific columns of our dataset, we can also summarize and aggregate this data to calculate summary statistics, such as the mean.
For example, we can group our dataset by year
using the
group_by()
function and calculate the mean annual values of
each numeric column using the summarize()
and
mean()
functions:
health_survey_summary <- health_survey_clean %>%
group_by(year) %>%
summarize(
mean_fruit_vegetables = mean(fruit_vegetable_consumption_mean_daily_portions, na.rm = TRUE),
mean_alcohol = mean(alcohol_consumption_mean_weekly_units, na.rm = TRUE),
mean_mental_wellbeing = mean(mental_wellbeing, na.rm = TRUE),
mean_self_assessed_health = mean(self_assessed_health, na.rm = TRUE)
)
health_survey_summary
The summarize()
function tells R to apply the summary
functions (mean()
in this case) to each group and to return
the results in a new data frame. By setting na.rm = TRUE
,
we ensure that missing values are ignored in the calculations (although
this is redundant here since we have already removed all missing
values).
Now, we have the mean values for fruit and vegetable consumption, alcohol consumption, mental wellbeing, and self-assessed health scores for each year, enabling us to make comparisons over time.
Next, we will learn how to add new columns to a data frame or modify
existing ones using the mutate()
function. This is a useful
tool for enhancing data interpretations, facilitating comparisons,
improving model performance, and deriving meaningful insights.
Let’s create a new variable that reflects an overall measure of health, combining diet, alcohol, mental wellbeing, and self-assessed health according to the formula:
overall_health
= 3 x mean_fruit_vegetables
- 0.5 x mean_alcohol
+ mean_mental_wellbeing
+
0.1 x mean_self_assessed_health
We can use the mutate()
function to do this, feeding it
the new variable overall_health
as an argument:
health_survey_new <- health_survey_clean %>%
mutate(overall_health = 3*fruit_vegetable_consumption_mean_daily_portions - 0.5*alcohol_consumption_mean_weekly_units + mental_wellbeing + 0.1*self_assessed_health)
head(health_survey_new)
Sometimes we may want to reshape our data to make it easier to
visualize or model. The tidyr
package from
tidyverse
provides functions such as
pivot_longer()
and pivot_wider()
to transform
data between wide and long formats.
health_survey
data
frame is in wide format.Let’s convert our health_survey_clean
data frame from
wide to long format using the pivot_longer()
function,
where cols
specifies which columns to pivot from wide to
long, names_to
specifies the name of the new column that
will contain the variable names, and values_to
specifies
the name of the new column that will contain the values:
health_survey_long <- health_survey_clean %>%
pivot_longer(cols = c(fruit_vegetable_consumption_mean_daily_portions, alcohol_consumption_mean_weekly_units, mental_wellbeing, self_assessed_health),
names_to = "measurement",
values_to = "value")
head(health_survey_long)
Note: all entries in the “value” column must be of the same type, so you cannot combine types factor and integer.
As you can see, the first 4 rows correspond with the same entry, while the next 4 rows correspond to a different entry, and so on.
This means that there are many more rows in a long format data frame, but fewer columns:
dim(health_survey_long)
## [1] 2480 5
We can convert our data frame from long to wide format using the
pivot_wider()
function. In this function,
names_from
specifies the column that contains the variable
names, and values_from
specifies the column that contains
the values:
health_survey_wide <- health_survey_long %>%
pivot_wider(names_from = measurement,
values_from = value)
Sometimes, you need to combine data from different sources into one data frame. There are several ways to do this in R.
Let’s start by importing sleep_info.csv
, which we
created in Exercise 1. We can do this using the read_csv()
function from readr
in tidyverse
:
sleep_info <- read_csv('data/sleep_info.csv')
## New names:
## Rows: 10 Columns: 3
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (3): ...1, year, hours_sleep_per_night
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Let’s inspect the sleep_info
data frame, as importing a
CSV file might cause column names or data types to be different from
what you expect. CSV files can sometimes introduce extra whitespace in
column names or change data types.
Use the str()
and names()
functions to
check the data types and column names of the CSV file:
str(sleep_info)
## spc_tbl_ [10 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:10] 1 2 3 4 5 6 7 8 9 10
## $ year : num [1:10] 2021 2012 2020 2009 2010 ...
## $ hours_sleep_per_night: num [1:10] 6.5 8.1 7.7 7.9 7.5 6.9 7.8 7.4 5.6 7.1
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. year = col_double(),
## .. hours_sleep_per_night = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
names(sleep_info)
## [1] "...1" "year" "hours_sleep_per_night"
It looks like the CSV file has added a column “…1” to the data frame. Let’s remove this so it won’t cause us problems later on.
Subset the sleep_info
data frame to include all columns
expect the first one:
sleep_info <- sleep_info[, -1]
Let’s check the column names of sleep_info
to make sure
the first column “…1” has been removed:
names(sleep_info)
## [1] "year" "hours_sleep_per_night"
We want to combine our sleep_info
data frame with our
health_survey_clean
data frame. This can be done using the
merge()
function, which combines data frames by common
columns or row names (in this case, by year):
health_survey_merged <- merge(health_survey_clean, sleep_info, by = "year")
head(health_survey_merged)
We can see that we now have a new column “hours_sleep_per_night”
added to the original health_survey
dataset.
In addition to merge()
, the dplyr
package
provides more flexible join functions. This is useful if the datasets
you’re joining have non-matching keys.
The first kind of join we will look at is a left join. This returns
all rows from health_survey_clean
and all columns from both
data frames. Rows in health_survey_clean
with no match in
sleep_info
will have NA
values in the new
columns. If there are multiple matches, all combinations of the matches
are returned.
health_survey_left_join <- health_survey_clean %>%
left_join(sleep_info, by = "year")
head(health_survey_left_join)
by = "year"
specifies that we would like to join the
data frames by matching the “year” column.
A right join returns all rows from sleep_info_long
, and
all columns from both data frames. Rows in sleep_info_long
with no match in health_survey_clean
will have
NA
values in the new columns. If there are multiple
matches, all combinations of the matches are returned.
health_survey_right_join <- health_survey_clean %>%
right_join(sleep_info, by = "year")
tail(health_survey_right_join)
An inner join combines the data frames, keeping only rows that have
matching values in both data frames without adding any NA
values.
health_survey_inner_join <- health_survey_clean %>%
inner_join(sleep_info, by = "year")
head(health_survey_inner_join)
As you can see, there are no entries the years 2011, 2013, 2015, 2017, 2018, 2019, 2020, or 2023, as these years do not have matching values in both data frames.
unique(health_survey_inner_join$year)
## [1] 2021 2012 2009 2010 2022 2014 2016 2008
Performing an inner join operation does not introduce any
NA
values into the new data frame and instead removes any
rows that would have otherwise introduced missing values:
any(is.na(health_survey_inner_join))
## [1] FALSE
A full join keeps all rows from both data frames, filling in
NA
for any missing matches.
health_survey_full_join <- health_survey_clean %>%
full_join(sleep_info, by = "year")
tail(health_survey_full_join)
Now we have a good idea of how to summarize, aggregate, and transform our data. Move on to Exercise 4 to learn about plotting.