This exercise is designed to familiarize you with the RStudio environment and basic operations in R. For a more comprehensive introduction to R, you can refer to the official R documentation at CRAN R Manuals or the tutorials available on the RStudio Education website.
Go ahead and open RStudio. It should look something like this:
Image source: RStudio User Guide Website
The R console (bottom left) is where the R code is actually executed. You can interact with R directly by typing commands and seeing their output. However, the code you write will not be saved, so it is not recommended to write all your code here.
The source editor (top left) is where you should write and edit your R scripts and markdown documents. Files written here can be saved and returned to later. To run each line of code from the R script, select the line and press Ctrl + Enter (Windows) or Cmd + Enter (macOS). This sends the code to the R console to be executed.
The environment (top right) shows objects that are currently in your workspace (e.g., data frames, variables, functions). These can be objects that you created from scratch or imported.
The output (bottom right) displays plots generated by your R code. The Files tab allows you to browse files to navigate your project directory. The Packages tab allows you to manage R packages. The Help tab provides access to R documentation and help files. If you are unsure how to use a function or package, simply type its name into the Help tab to receive detailed usage information.
It is important to keep all files organised for efficient data management and project workflow. To get started, create a new folder named “data-handling-in-r”. This will be our working directory for this tutorial.
As mentioned above, although you can use the R console for writing code, it is better to write code in R Scripts in the source panel. To create a new R Script, go to the first icon in the top left corner of RStudio, then select “R Script”.
Name your R Script “exercise-1.R” and save it to your working directory.
An alternative way of documenting code is by using R Markdown. R Markdown is useful for integrating code, text, and figures within a single document, making it easier to share and reproduce work.
To create a new R Markdown file, go to the same icon in the top left corner, this time selecting “R Markdown”.
You will see this dialog, allowing you to set the title, author, date, and default output format of your R Markdown document. Enter a descriptive title for your project (note that this is different from the file name). Then, click OK. Save the file to your working directory as “exercise-1.Rmd”.
Within your R Markdown document, you can use Markdown syntax for formatting text and LaTeX for mathematical expressions. R code is embedded within code chunks enclosed by triple back ticks and {r}. For example:
print("Hello world!")
[1] “Hello world!”
You can use the keyboard shortcut Ctrl + Alt + I (Windows) or Cmd + Option + I (macOS) to quickly insert an R code chunk.
Click the play button at the top right of the code chunk to run the code, or select the line of code and press Ctrl + Enter (Windows) or Cmd + Enter (macOS).
You can use R to do basic operations that you would do on a calculator. For example, open up “exercise-1.R” and type the following lines of code:
5 + 5 # addition
[1] 10
4 * 3 # multiplication
[1] 12
10 / 2 # division
[1] 5
2^5 # exponent
[1] 32
13 %% 4 # modulus
[1] 1
10 %/% 3 # integer division
[1] 3
When you click run, you’ll see the outputs of these calculations in the console.
Note: Using #
allows you to write comments which are not
interpreted by the R console.
Try the following examples:
7 * (4 - 2)
[1] 14
sqrt(100)
[1] 10
abs(-6)
[1] 6
abs(7)
[1] 7
In R, you can use objects to store information. We use the assignment
operator <-
to assign values to objects. Try out the
following:
x <- log(2^3)
Here, we are assigning the object x
the value of
whatever the result of the operation \(\log(2^3)\) is.
We can see the actual value by calling the object x
:
x
[1] 2.079442
You can then use the object to do subsequent computations, e.g.,
x*5
[1] 10.39721
If you assign a different value to the same object name, you will replace the original object and its value will be lost. So, be careful in naming your objects!
Note: R is case sensitive, so object x
is not the same
as object X
. You will get an error if you use the wrong
case:
X
## Error in eval(expr, envir, enclos): object 'X' not found
You can assign objects a value of any type, not just numbers. For example, you can store a string of characters by enclosing it in quotation marks:
course <- "Data Handling in R"
course
[1] “Data Handling in R”
However, you can’t mix types when performing arithmetic operations:
x + course
## Error in x + course: non-numeric argument to binary operator
You can ask R what type a certain object’s value is by using the
class()
function:
class(x)
[1] “numeric”
class(course)
[1] “character”
class(sqrt)
[1] “function”
Comparison operators are used to compare values. There are also called conditions.
Try the following examples:
5 > 2 # greater than
[1] TRUE
6 < 4 # less than
[1] FALSE
11 >= 15 # greater than or equal to
[1] FALSE
10 <= 10 # less than or equal to
[1] TRUE
2^3 == 8 # equal to
[1] TRUE
6/2 != 4 # not equal to
[1] TRUE
You can assign conditions to an object:
op <- 2^3 == 8
class(op)
[1] “logical”
As you can see, the output of these operations are all TRUE/FALSE (boolean) values. In R, these objects are of class logical.
If you try to perform arithmetic operations on logicals, TRUE becomes 1 and FALSE becomes 0.
TRUE + 10
[1] 11
FALSE - 10
[1] -10
You can use logical operators to combine conditional statements.
For example, x & y
returns TRUE
if both
x
is TRUE
and y
is
TRUE
. If either x
or y
is
FALSE
, the operations will return FALSE
. This
operator &
is called the element-wise logical AND
operator.
In contrast, x | y
returns TRUE
if either
x
is TRUE
or y
is
TRUE
. Therefore, the operation will only return
FALSE
if both x
and y
are
FALSE
. This operator |
is called the
element-wise logical OR operator.
x <- TRUE
y <- FALSE
x & y # logical AND operator
[1] FALSE
x | y # logical OR operator
[1] TRUE
Base R contains many useful tools for data analysis (such as those seen so far in this tutorial), but there are many additional functionalities that users might need, such as advanced data visualization, specialized statistical methods, or handling specific types of data. There are packages available in R that contain collections of functions, data, and compiled code that enhance the functionality of base R, making our life a bit easier.
Thousands of packages are available on CRAN (Comprehensive R Archive Network) and other repositories, each designed for a specific task.
In this tutorial, we will be using tidyverse
, which is a
collection of packages designed for data science. These include:
readr
: used for reading rectangular data into R
(e.g. csv, tsv and fwf)tibble
: a user-friendly way to use data framesdplyr
: provides functions for data manipulationggplot2
: used for data visualization plotsWe will be coming back to these later on. For now, we need to first
install and load the tidyverse
package. Write the following
code in your R script:
# Install tidyverse
install.packages("tidyverse")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
# Load tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
When you load tidyverse
, it automatically loads the
packages within it, including readr
, tibble
,
dplyr
, and ggplot2
.
An alternative way to install a package is in the output (bottom
right box in RStudio), go to the “Packages” tab, then click “Install” in
the top left corner. In the pop up, type “tidyverse” under “Packages”
(it should come up as you are typing), then click install. Continue to
load the package using library(tidyverse)
in the R script
as above.
You can follow these steps to install other packages in the future,
but we’ll just stick with tidyverse
for now.
Note: you only need to install a package onto your local computer once, but you need to load the package every time you want to use it.
Conventionally, you should load all required packages at the top of the R script, before any lines of code.
A data frame is a list of vectors, all of the same length. Data frames in R are similar to spreadsheets, where each column can contain different types of data (numeric, character, factor, etc.), and each row represents an instance or observation.
We can create a data frame by first creating vectors then combining them.
In R, vectors are basic data structures that hold elements of the same type. We use the function c() to create vectors, where the “c” stands for combine.
Go ahead and create two vectors called “year” and “hours_sleep_per_night”, each containing a series of ten values:
year <- c(2021, 2012, 2020, 2009, 2010, 2022, 2014, 2023, 2016, 2008)
hours_sleep_per_night <- c(6.5, 8.1, 7.7, 7.9, 7.5, 6.9, 7.8, 7.4, 5.6, 7.1)
Next, we can combine these vectors into a data frame using the
function data.frame()
:
sleep_info <- data.frame(year, hours_sleep_per_night)
df [10 x 2]
means that we have created a data frame with
10 rows and 2 columns.
An alternative, more modern way of creating a data frame is to use
the tibble
package, which is part of
tidyverse
. Tibbles are a modern re-imagining of the data
frame. They offer more user-friendly printing methods which makes them
easier to use with large datasets containing complex objects.
Let’s go ahead and convert the speed_info
data frame
into a tibble using as_tibble()
:
sleep_info <- as_tibble(sleep_info)
Print the newly generated tibble to see it displayed:
year | hours_sleep_per_night |
---|---|
2021 | 6.5 |
2012 | 8.1 |
2020 | 7.7 |
2009 | 7.9 |
2010 | 7.5 |
2022 | 6.9 |
2014 | 7.8 |
2023 | 7.4 |
2016 | 5.6 |
2008 | 7.1 |
We will be using sleep_info
again in exercise 3. You can
also create a new tibble from column vectors with
tibble()
:
eg_tibble <- tibble(x = 1:5, y = 1)
Print eg_tibble
to see it displayed:
x | y |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 1 |
5 | 1 |
Here 1:5
means a sequence of numbers from 1 to 5.
Once a data frame has been created, you can add or transform its
columns. This is performed using the mutate()
function from
the dplyr
package (also part of
tidyverse
).
Try adding a column \(z = x^2 + y\)
to eg_tibble
:
eg_tibble <- eg_tibble %>%
mutate(z = x^2 + y)
x | y | z |
---|---|---|
1 | 1 | 2 |
2 | 1 | 5 |
3 | 1 | 10 |
4 | 1 | 17 |
5 | 1 | 26 |
The pipe operator (%>%
) takes the value on its left
and passes it as the first argument to the function on its right. In
this case, our data frame eg_tibble
is passed as the first
argument to mutate()
. We will use the pipe operator more in
the following exercises, as it helps make operations more readable and
concise.
We will often have to export data frames to csv files after working
with them. This is easily done in R using write.csv()
.
Let’s try an example and export sleep_info
to a CSV
file.
First, create a subdirectory in your working directory called “data”:
dir.create("data")
## Warning in dir.create("data"): 'data' already exists
Then, export the sleep_info
data frame to a CSV file,
saving it to the “data” subdirectory:
write.csv(sleep_info, file = "data/sleep_info.csv")
You should now be able to see “sleep_info.csv” in the “data-handling-in-r/data” subdirectory.