Exercise 3: Data Transformation

At this stage, we have a good understanding of our dataset’s characteristics, it has been cleaned, and it is now ready for analysis.

Filter and Select Data

Open “scottish-health-survey.R” in RStudio. The first thing we will learn is how to filter and select specific rows and columns according to certain criteria.

We can do this using the filter() and select() functions from dplyr. The filter() function is used to filter rows based on conditions, while the select() function is used to select specific columns.

For example, to filter for entries from the year 2019, we filter on the condition that the entry in the year column is equal to “2019”. If this condition is met, that row is included, otherwise it is excluded:

year_2019 <- health_survey_clean %>%
  filter(year == "2019")

head(year_2019)

Next, we can select only the “fruit_vegetable_consumption_mean_daily_portions” and “alcohol_consumption_mean_weekly_units” columns of our dataset, dropping all other columns:

diet <- health_survey_clean %>%
  select(fruit_vegetable_consumption_mean_daily_portions, alcohol_consumption_mean_weekly_units)

head(diet)

An alternative method to filter() and select() is to use the base R function subset(), which takes the data frame as the first argument, the subset condition as the second argument, and the columns to include in the subset as the third argument.

Here, we are filtering for the sex “Female” and selecting the columns “mental_wellbeing” and “self_assessed_health”:

mental_health_f <- subset(health_survey_clean, sex == "Female", c(mental_wellbeing, self_assessed_health))

head(mental_health_f)

Another method for subsetting data frames in R is by using square brackets []. The syntax for subsetting a data frame is data_frame[rows, columns], where rows specifies which rows to include and columns specifies which columns to include

Try subsetting the data frame mental_health_f so that it only includes the first 10 rows:

mental_health_f_subset <- mental_health_f[1:10, ]

mental_health_f_subset

Here, we ask that the rows 1 to 10 are included, and we have left the second argument blank to indicate all columns are included.

This time, try subsetting the health_survey_clean data frame to include the first 10 rows and columns 3 to 5:

health_survey_subset <- health_survey_clean[1:10, 3:5]

health_survey_subset

To include all rows and all columns except the first, we use the notation data_frame[ , -1]:

health_survey_minus_first_column <- health_survey_clean[ , -1]

health_survey_minus_first_column

We can also subset non-incremental rows and columns using square brackets. We do this by creating a vector of integers using the c() function, indicating which rows/columns to include.

For example, to subset columns 2 and 5 of the health_survey_clean data frame, we do the following:

health_survey_1_5 <- health_survey_clean[ , c(2,5)]

health_survey_1_5

Summarize and Aggregate Data

Now that we can filter rows and select specific columns of our dataset, we can also summarize and aggregate this data to calculate summary statistics, such as the mean.

For example, we can group our dataset by year using the group_by() function and calculate the mean annual values of each numeric column using the summarize() and mean() functions:

health_survey_summary <- health_survey_clean %>%
  group_by(year) %>%
  summarize(
    mean_fruit_vegetables = mean(fruit_vegetable_consumption_mean_daily_portions, na.rm = TRUE),
    mean_alcohol = mean(alcohol_consumption_mean_weekly_units, na.rm = TRUE),
    mean_mental_wellbeing = mean(mental_wellbeing, na.rm = TRUE),
    mean_self_assessed_health = mean(self_assessed_health, na.rm = TRUE)
  )

health_survey_summary

The summarize() function tells R to apply the summary functions (mean() in this case) to each group and to return the results in a new data frame. By setting na.rm = TRUE, we ensure that missing values are ignored in the calculations (although this is redundant here since we have already removed all missing values).

Now, we have the mean values for fruit and vegetable consumption, alcohol consumption, mental wellbeing, and self-assessed health scores for each year, enabling us to make comparisons over time.

Transform and Create New Variables

Next, we will learn how to add new columns to a data frame or modify existing ones using the mutate() function. This is a useful tool for enhancing data interpretations, facilitating comparisons, improving model performance, and deriving meaningful insights.

Let’s create a new variable that reflects an overall measure of health, combining diet, alcohol, mental wellbeing, and self-assessed health according to the formula:

overall_health = 3 x mean_fruit_vegetables - 0.5 x mean_alcohol + mean_mental_wellbeing + 0.1 x mean_self_assessed_health

We can use the mutate() function to do this, feeding it the new variable overall_health as an argument:

health_survey_new <- health_survey_clean %>%
  mutate(overall_health = 3*fruit_vegetable_consumption_mean_daily_portions - 0.5*alcohol_consumption_mean_weekly_units + mental_wellbeing + 0.1*self_assessed_health)

head(health_survey_new)

Sometimes we may want to reshape our data to make it easier to visualize or model. The tidyr package from tidyverse provides functions such as pivot_longer() and pivot_wider() to transform data between wide and long formats.

Wide Format: Standard “Excel” format, where each variable is in a separate column. The health_survey data frame is in wide format.
Long Format: Data where each row contains one variable per subject.

Let’s convert our health_survey_clean data frame from wide to long format using the pivot_longer() function, where cols specifies which columns to pivot from wide to long, names_to specifies the name of the new column that will contain the variable names, and values_to specifies the name of the new column that will contain the values:

health_survey_long <- health_survey_clean %>%
  pivot_longer(cols = c(fruit_vegetable_consumption_mean_daily_portions, alcohol_consumption_mean_weekly_units, mental_wellbeing, self_assessed_health),
               names_to = "measurement",
               values_to = "value")

head(health_survey_long)

Note: all entries in the “value” column must be of the same type, so you cannot combine types factor and integer.

As you can see, the first 4 rows correspond with the same entry, while the next 4 rows correspond to a different entry, and so on.

This means that there are many more rows in a long format data frame, but fewer columns:

dim(health_survey_long)

## [1] 2480    5

We can convert our data frame from long to wide format using the pivot_wider() function. In this function, names_from specifies the column that contains the variable names, and values_from specifies the column that contains the values:

health_survey_wide <- health_survey_long %>%
  pivot_wider(names_from = measurement,
              values_from = value)

Merge and Join Datasets

Sometimes, you need to combine data from different sources into one data frame. There are several ways to do this in R.

Let’s start by importing sleep_info.csv, which we created in Exercise 1. We can do this using the read_csv() function from readr in tidyverse:

sleep_info <- read_csv('data/sleep_info.csv')

## New names:
## Rows: 10 Columns: 3
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (3): ...1, year, hours_sleep_per_night
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Let’s inspect the sleep_info data frame, as importing a CSV file might cause column names or data types to be different from what you expect. CSV files can sometimes introduce extra whitespace in column names or change data types.

Use the str() and names() functions to check the data types and column names of the CSV file:

str(sleep_info)

## spc_tbl_ [10 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1                 : num [1:10] 1 2 3 4 5 6 7 8 9 10
##  $ year                 : num [1:10] 2021 2012 2020 2009 2010 ...
##  $ hours_sleep_per_night: num [1:10] 6.5 8.1 7.7 7.9 7.5 6.9 7.8 7.4 5.6 7.1
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   year = col_double(),
##   ..   hours_sleep_per_night = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

names(sleep_info)

## [1] "...1"                  "year"                  "hours_sleep_per_night"

It looks like the CSV file has added a column “…1” to the data frame. Let’s remove this so it won’t cause us problems later on.

Subset the sleep_info data frame to include all columns expect the first one:

sleep_info <- sleep_info[, -1]

Let’s check the column names of sleep_info to make sure the first column “…1” has been removed:

names(sleep_info)

## [1] "year"                  "hours_sleep_per_night"

We want to combine our sleep_info data frame with our health_survey_clean data frame. This can be done using the merge() function, which combines data frames by common columns or row names (in this case, by year):

health_survey_merged <- merge(health_survey_clean, sleep_info, by = "year")

head(health_survey_merged)

We can see that we now have a new column “hours_sleep_per_night” added to the original health_survey dataset.

In addition to merge(), the dplyr package provides more flexible join functions. This is useful if the datasets you’re joining have non-matching keys.

Left Join

The first kind of join we will look at is a left join. This returns all rows from health_survey_clean and all columns from both data frames. Rows in health_survey_clean with no match in sleep_info will have NA values in the new columns. If there are multiple matches, all combinations of the matches are returned.

health_survey_left_join <- health_survey_clean %>%
  left_join(sleep_info, by = "year")

head(health_survey_left_join)

by = "year" specifies that we would like to join the data frames by matching the “year” column.

Right Join

A right join returns all rows from sleep_info_long, and all columns from both data frames. Rows in sleep_info_long with no match in health_survey_clean will have NA values in the new columns. If there are multiple matches, all combinations of the matches are returned.

health_survey_right_join <- health_survey_clean %>%
  right_join(sleep_info, by = "year")

tail(health_survey_right_join)

Inner Join

An inner join combines the data frames, keeping only rows that have matching values in both data frames without adding any NA values.

health_survey_inner_join <- health_survey_clean %>%
  inner_join(sleep_info, by = "year")

head(health_survey_inner_join)

As you can see, there are no entries the years 2011, 2013, 2015, 2017, 2018, 2019, 2020, or 2023, as these years do not have matching values in both data frames.

unique(health_survey_inner_join$year)

## [1] 2021 2012 2009 2010 2022 2014 2016 2008

Performing an inner join operation does not introduce any NA values into the new data frame and instead removes any rows that would have otherwise introduced missing values:

any(is.na(health_survey_inner_join))

## [1] FALSE

Full Join

A full join keeps all rows from both data frames, filling in NA for any missing matches.

health_survey_full_join <- health_survey_clean %>%
  full_join(sleep_info, by = "year")

tail(health_survey_full_join)

Now we have a good idea of how to summarize, aggregate, and transform our data. Move on to Exercise 4 to learn about plotting.