The final exercise of this tutorial will be on creating plots to visualize data.
R has a number of built-in tools for creating many types of graphs,
such as scatter plots, histograms, bar charts and boxplots, as well as
some more advanced plotting functions using ggplot2
from
tidyverse
.
A scatter plot is very useful for visualizing the relationship between two continuous numeric variables, e.g. the relationship between alcohol consumption and mental wellbeing.
We can use the base R function
plot(y ~ x, data=data_frame)
to create a scatter plot,
where: - x
is the independent variable - y
is
the dependent variable - data_frame
is the data frame in
which to extract the variables x
and y
Let’s have a go at plotting number of casualties against number of vehicles:
plot(mental_wellbeing ~ alcohol_consumption_mean_weekly_units, data=health_survey_clean)
Each point on the graph represents a sample. The graph reveals a general trend: as alcohol consumption increases, mental wellbeing scores tend to decrease.
We can give our plot a title and rename the axes:
plot(mental_wellbeing ~ alcohol_consumption_mean_weekly_units, data=health_survey_clean,
main="Relationship between alcohol consumption and mental wellbeing",
xlab="Alcohol consumed (mean weekly units)", ylab="Mental wellbeing score")
We can also change the color of the data points by adding the
argument col=
:
plot(mental_wellbeing ~ alcohol_consumption_mean_weekly_units, data=health_survey_clean,
main="Relationship between alcohol consumption and mental wellbeing",
xlab="Mean weekly units of alcohol consumed", ylab="Mental wellbeing score",
col="blue")
A histogram is a useful way to visualize numerical data if we are interested in the overall distribution, such as the distribution of daily portions of fruits and vegetables consumed.
To plot a histogram use the hist()
function:
hist(health_survey_clean$fruit_vegetable_consumption_mean_daily_portions,
main = "Distribution of Daily Fruit and Vegetable Consumption",
xlab = "Mean Daily Portions of Fruits and Vegetables",
col = "purple")
Using data_frame$variable
is another way to extract a
variable from a data frame. A list of variables should appear after the
$
sign:
Box plots are useful for visualizing the distribution of numeric variables, comparing differences between factor levels, and identifying outliers.
The box in the box plot represents the interquartile range (IQR), which contains the middle 50% of the data. The line inside the box shows the median, while the whiskers extend to the smallest and largest values within 1.5 times the IQR from the lower and upper quartiles. Points outside this range are considered outliers.
Let’s create a box plot to compare the distribution of self-assessed health scores by sex:
boxplot(self_assessed_health ~ sex, data = health_survey_clean,
main = "Self-Assessed Health Scores by Sex",
xlab = "Sex",
ylab = "Self-Assessed Health Score",
col = c("maroon", "orange"))
The box plots indicate that females and males have a fairly similar distribution of self-assessed health scores.
An alternative way to generate plots is the use the
ggplot2
package from tidyverse
. The idea of
ggplot2
is to build a plot layer by layer. This can lead to
much more detailed plots than those offered in base R.
Let’s return to our box plot comparing self-assessed health scores by sex.
ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health))
We use aes()
to describe how we want our variables to be
mapped to visual properties (aesthetics) of geoms. Above, we have
specified that we would like body_mass_g
on the x-axis,
flipper_length_mm
on the y-axis, and the points to be color
coded according to species
.
geom_point()
: Creates a scatter plot.geom_smooth()
: Adds a smoothed line of best fit.geom_histogram()
: Generates a histogram to show the
distribution of a single continuous variable.geom_boxplot()
: Draws box plots to summarize the
distribution of a continuous variable and identify outliers.geom_line()
: Creates a line plot.geom_jitter()
: Adds jittered points to a plot,
preventing over plotting by spreading out overlapping data points
slightly.Learn more about the different geoms you can add to your plots by visiting the ggplot2 documentation at https://ggplot2.tidyverse.org/reference/.
Let’s plot a box plot using geom_boxplot()
:
ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health)) +
geom_boxplot()
Enhance your plot by adding color coding according to sex. To do this,
add the argument
aes(fill = sex)
to
geom_boxplot
:
ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health)) +
geom_boxplot(aes(fill = sex))
You can add multiple geoms to the same graph to enhance its visualization. For example, let’s add jittered points to our box plot to show the exact distribution of data points:
ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health)) +
geom_boxplot(aes(fill = sex)) +
geom_jitter(color = "black", size = 1, alpha = 0.7, width = 0.2)
The
color
argument specifies the color of the points,
size
determines their size, alpha
sets the
transparency level, and width
controls the horizontal
spread of the points along the central axis.
ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health)) +
geom_boxplot(aes(fill = sex), alpha = 0.5) +
geom_jitter(color = "black", size = 1, alpha = 0.7, width = 0.2) +
labs(title = "Self-Assessed Health Scores by Sex",
x = "Sex",
y = "Self-Assessed Health Score")
ggplot(health_survey_clean, aes(x = sex, y = self_assessed_health)) +
geom_boxplot(aes(fill = sex), alpha = 0.5) +
geom_jitter(color = "black", size = 1, alpha = 0.7, width = 0.2) +
labs(title = "Self-Assessed Health Scores by Sex",
x = "Sex",
y = "Self-Assessed Health Score") +
theme_minimal()
There are many themes to choose from, including:
theme_classic()
theme_gray()
theme_dark()
theme_minimal()
As you can see, using ggplot2
allows for greater
creativity and control over the features of your plot.
In this tutorial, we covered essential operations in R, including project management in RStudio, working with data frames, and creating plots. I hope you found the tutorial both enjoyable and useful. Good luck and happy coding!