Exercise 4: Data Visualization and Plotting

The final exercise of this tutorial will be on creating plots to visualize data.

R has a number of built-in tools for creating many types of graphs, such as scatter plots, histograms, bar charts and boxplots, as well as some more advanced plotting functions using ggplot2 from tidyverse.

Scatter Plots

A scatter plot is very useful for visualizing the relationship between two continuous numeric variables, e.g. the relationship between alcohol consumption and mental wellbeing.

We can use the base R function plot(y ~ x, data=data_frame) to create a scatter plot, where: - x is the independent variable - y is the dependent variable - data_frame is the data frame in which to extract the variables x and y

Let’s have a go at plotting number of casualties against number of vehicles:

plot(mental_wellbeing ~ alcohol_consumption_mean_weekly_units, data=health_survey_clean)

Each point on the graph represents a sample. The graph reveals a general trend: as alcohol consumption increases, mental wellbeing scores tend to decrease.

We can give our plot a title and rename the axes:

plot(mental_wellbeing ~ alcohol_consumption_mean_weekly_units, data=health_survey_clean,
     main="Relationship between alcohol consumption and mental wellbeing",
     xlab="Alcohol consumed (mean weekly units)", ylab="Mental wellbeing score")

We can also change the color of the data points by adding the argument col=:

plot(mental_wellbeing ~ alcohol_consumption_mean_weekly_units, data=health_survey_clean,
     main="Relationship between alcohol consumption and mental wellbeing",
     xlab="Mean weekly units of alcohol consumed", ylab="Mental wellbeing score",
     col="blue")

Histograms

A histogram is a useful way to visualize numerical data if we are interested in the overall distribution, such as the distribution of daily portions of fruits and vegetables consumed.

To plot a histogram use the hist() function:

hist(health_survey_clean$fruit_vegetable_consumption_mean_daily_portions,
     main = "Distribution of Daily Fruit and Vegetable Consumption",
     xlab = "Mean Daily Portions of Fruits and Vegetables",
     col = "purple")

Using data_frame$variable is another way to extract a variable from a data frame. A list of variables should appear after the $ sign:

Box Plots

Box plots are useful for visualizing the distribution of numeric variables, comparing differences between factor levels, and identifying outliers.

The box in the box plot represents the interquartile range (IQR), which contains the middle 50% of the data. The line inside the box shows the median, while the whiskers extend to the smallest and largest values within 1.5 times the IQR from the lower and upper quartiles. Points outside this range are considered outliers.

Let’s create a box plot to compare the distribution of self-assessed health scores by sex:

boxplot(self_assessed_health ~ sex, data = health_survey_clean,
        main = "Self-Assessed Health Scores by Sex",
        xlab = "Sex",
        ylab = "Self-Assessed Health Score",
        col = c("maroon", "orange"))

The box plots indicate that females and males have a fairly similar distribution of self-assessed health scores.

ggplot2

An alternative way to generate plots is the use the ggplot2 package from tidyverse. The idea of ggplot2 is to build a plot layer by layer. This can lead to much more detailed plots than those offered in base R.

Let’s return to our box plot comparing self-assessed health scores by sex.

Start by specifying the axis in which to plot the graph:

ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health))

We use aes() to describe how we want our variables to be mapped to visual properties (aesthetics) of geoms. Above, we have specified that we would like body_mass_g on the x-axis, flipper_length_mm on the y-axis, and the points to be color coded according to species.

Then, add the geoms you want shown. There are many possible geoms, including:

geom_point(): Creates a scatter plot.
geom_smooth(): Adds a smoothed line of best fit.
geom_histogram(): Generates a histogram to show the distribution of a single continuous variable.
geom_boxplot(): Draws box plots to summarize the distribution of a continuous variable and identify outliers.
geom_line(): Creates a line plot.
geom_jitter(): Adds jittered points to a plot, preventing over plotting by spreading out overlapping data points slightly.

Learn more about the different geoms you can add to your plots by visiting the ggplot2 documentation at https://ggplot2.tidyverse.org/reference/.

Let’s plot a box plot using geom_boxplot():

ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health)) +
  geom_boxplot()

Enhance your plot by adding color coding according to sex. To do this, add the argument aes(fill = sex) to geom_boxplot:

ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health)) +
  geom_boxplot(aes(fill = sex))

You can add multiple geoms to the same graph to enhance its visualization. For example, let’s add jittered points to our box plot to show the exact distribution of data points:

ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health)) +
  geom_boxplot(aes(fill = sex)) +
  geom_jitter(color = "black", size = 1, alpha = 0.7, width = 0.2)

The color argument specifies the color of the points, size determines their size, alpha sets the transparency level, and width controls the horizontal spread of the points along the central axis.

Then, add the labels:

ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health)) +
  geom_boxplot(aes(fill = sex), alpha = 0.5) +
  geom_jitter(color = "black", size = 1, alpha = 0.7, width = 0.2) +
  labs(title = "Self-Assessed Health Scores by Sex",
       x = "Sex",
       y = "Self-Assessed Health Score")

Finally, apply a theme:

ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health)) +
  geom_boxplot(aes(fill = sex), alpha = 0.5) +
  geom_jitter(color = "black", size = 1, alpha = 0.7, width = 0.2) +
  labs(title = "Self-Assessed Health Scores by Sex",
       x = "Sex",
       y = "Self-Assessed Health Score") +
  theme_minimal()

There are many themes to choose from, including:

theme_classic()
theme_gray()
theme_dark()
theme_minimal()

As you can see, using ggplot2 allows for greater creativity and control over the features of your plot.

Summary

In this tutorial, we covered essential operations in R, including project management in RStudio, working with data frames, and creating plots. I hope you found the tutorial both enjoyable and useful. Good luck and happy coding!