The purpose of this tutorial is to show the very basics of the R language so that participants who have never used R before can complete the first assignment in this workshop. For information on the thousands of other features of R, see the suggested resources below.
In this tutorial, R code that you would enter in your script file or in the command line is preceded by the >
character, and by +
if the current line of code continues from a previous line. You do not need to type this character in your own code. Note that copying and pasting code from the PDF version of this tutorial may lead to errors when trying to execute code. Please copy code from the R script used to produce this tutorial; this script can be found here.
The most recent version of R for all operating systems is always located at http://www.r-project.org/index.html. Go directly to https://cloud.r-project.org, and download the latest version of R (4.0.5 as of April 2021) for your operating system. Then, install R.
To operate R, you should rely on writing R scripts. We will write these scripts in RStudio. Download the latest version of RStudio (1.4) from http://www.rstudio.org. Then, install it on your computer. Some text editors also offer integration with R, so that you can send code directly to R. RStudio is generally the best solution for running R and maintaining a reproducible workflow.
Lastly, you may want to install LaTeX in order to generate notebooks and reports in PDF format directly from within RStudio. To do this, please use one of these two options:
Upon opening the first time, RStudio will look similar to the screenshot below.
The window on the left is named “Console”. The point next to the blue “larger than” sign >
is the “command line”. You can tell R to perform actions by typing commands into this command line. We will rarely do this and operate R through script files instead.
In the following sections, I walk you through some basic R commands. In this tutorial and most other materials you will see in this workshop, R commands and the resulting R output will appear in light grey boxes. Output in this tutorial is always preceded by two ##
signs.
To begin, see how R responds to commands. If you type a simple mathematical operation, R will return its result(s):
1 + 1
## [1] 2
2 * 3
## [1] 6
10 / 3
## [1] 3.333333
R will return error messages when a command is incorrect or when it cannot execute a command. Often, these error messages are informative. You can often get more information by simply searching for an error message on the web. Here, I try to add 1 and the letter a, which does not (yet) make sense as I haven’t defined an object a
yet and numbers and letters cannot be added:
1 + a
## Error in eval(expr, envir, enclos): object 'a' not found
As your coding will become more complex, you may forget to complete a particular command. For example, here I want to add 1 and the product of 2 and 4. But unless I add the parenthesis at the end of the line, or in the immediately following line, this code won’t execute:
1 + (2 * 4
)
## [1] 9
While executing this command and looking at the console, you will notice that the little >
on the left changes into a +
. This means that R is offering you a new line to finish the original command. If I type a right parenthesis, R returns the result of my operation.
Many useful and important functions in R are provided via packages that need to be installed separately. You can do this by using the Package Installer in the menu (Packages & Data – Package Installer in R or Tools – Install Packages… in RStudio), or by typing
install.packages("rio")
in the R command line. Next, in every R session or script, you need to load the packages you want to use: type
library("rio")
in the R command line. You only need to install packages once on your (or any) computer, but you need to load them anew in each R session.
Alternatively, if you only want to access one particular function from a package, but do not want to load the whole package, you can use the packagename::function
option.
In most cases, it is useful to set a project-specific working directory — especially if you work with many files and want to create graphics that you want to have printed to .pdf or .eps files. You can set the WD with this command:
setwd("/Users/johanneskarreth/Documents/Dropbox/Uni/9 - ICPSR/2021/Bayes/Slides/Lab 1")
You can typically see your current working directory on top of the R console in RStudio, or you can obtain the working directory with this command:
getwd()
## [1] "/Users/johanneskarreth/Documents/Dropbox/Uni/9 - ICPSR/2021/Bayes/Slides/Lab 1"
RStudio also offers a very useful option to set up a whole project (File – New Project…). Projects automatically create a working directory for you. Even though we won’t use projects in this workshop, I recommend them as an easy and failproof way to manage files and directories.
Within R, you can access the help files for any command that exists by typing ?commandname
or, for a list of the commands within a package, by typing help(package = packagename)
. So, for instance:
?rnormhelp(package = "rio")
There are many resources on how to structure your R workflow (think of routines like the ones suggested by J. Scott Long in The Workflow of Data Analysis Using Stata), and I encourage you to search for and maintain a consistent approach to working with R. It will make your life much, much easier—with regards to collaboration, replication, and general efficiency. We recommend following the Project TIER protocol. In addition, here are a few really important points that you might want to consider as you start using R:
attach()
command.As R has become one of the most popular programs for statistical computing, the number of resources in print and online has increased dramatically. Searching for terms like “introduction to R software” will return a huge number of results.
Some (of the many) good resources that I have encountered and found useful are:
R is an object-oriented programming language. This means that you, the user, create objects and work with them. Objects can be of different types. To create an object, first type the object name, then the “assignment character”, a leftward arrow <-
, then the content of an object. To display an object, simply type the object’s name, and it will be printed to the console.
You can then apply functions to objects. Most functions have names that are somewhat descriptive of their purpose. For example, mean()
calculates the mean of the numbers within the parentheses, and log()
calculates the natural logarithm of the number(s) within the parentheses.
Functions consist of a function name, the function’s arguments, and specific values passed to the arguments. In symbolic terms:
function_name(argument1 = value,
argument2 = value)
Here is a specific example of the function abbreviate
, its first argument names.arg
, and the value "Regression"
that I provide to the argument x
:
abbreviate(names.arg = "Regression")
## Regression
## "Rgrs"
The following are the types of objects you need to be familiar with:
Scalars
Vectors of different types
"
Matrices
Data frames
Lists
Below, you find some more specific examples of different types of objects.
<- 1
x x
## [1] 1
<- 2
y + y x
## [1] 3
* y x
## [1] 2
/ y x
## [1] 0.5
^2 y
## [1] 4
log(x)
## [1] 0
exp(x)
## [1] 2.718282
<- c(1, 2, 3, 4, 5)
xvec xvec
## [1] 1 2 3 4 5
<- seq(from = 1, to = 5, by = 1)
xvec2 xvec2
## [1] 1 2 3 4 5
<- rep(1, 5)
yvec yvec
## [1] 1 1 1 1 1
<- xvec + yvec
zvec zvec
## [1] 2 3 4 5 6
<- matrix(data = c(1, 2, 3, 4, 5, 6), nrow = 3, byrow = TRUE)
mat1 mat1
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
<- matrix(data = seq(from = 6, to = 3.5, by = -0.5),
mat2 nrow = 2, byrow = T)
mat2
## [,1] [,2] [,3]
## [1,] 6.0 5.5 5.0
## [2,] 4.5 4.0 3.5
%*% mat2 mat1
## [,1] [,2] [,3]
## [1,] 15 13.5 12
## [2,] 36 32.5 29
## [3,] 57 51.5 46
<- c(1, 1, 3, 4, 7, 2)
y <- c(2, 4, 1, 8, 19, 11)
x1 <- c(-3, 4, -2, 0, 4, 20)
x2 <- c("Student 1", "Student 2", "Student 3", "Student 4",
name "Student 5", "Student 6")
<- data.frame(name, y, x1, x2)
mydata mydata
## name y x1 x2
## 1 Student 1 1 2 -3
## 2 Student 2 1 4 4
## 3 Student 3 3 1 -2
## 4 Student 4 4 8 0
## 5 Student 5 7 19 4
## 6 Student 6 2 11 20
You can use R to generate (random) draws from distributions. This will be important in the first problem set For instance, to generate 1000 draws from a normal distribution with a mean of 5 and standard deviation of 10, you would write:
<- rnorm(1000, mean = 5, sd = 10)
draws summary(draws)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -29.112 -1.378 5.391 5.299 12.136 39.169
You can then use a variety of plotting commands (see for more below) to visualize your draws:
<- rnorm(1000, mean = 5, sd = 10)
draws plot(density(draws), main = "This is a plot title",
xlab = "Label for the X-axis", ylab = "Label for the Y-axis")
<- rnorm(1000, mean = 5, sd = 10)
draws hist(draws)
<- c(4, 1, 5, 3)
vec 3] vec[
## [1] 5
$x1 mydata
## [1] 2 4 1 8 19 11
$names mydata
## NULL
1] mat1[ ,
## [1] 1 3 5
1, ] mat1[
## [1] 1 2
<- list(x1, x2, y)
mylist 1]] mylist[[
## [1] 2 4 1 8 19 11
In most cases, you will not type up your data by hand, but use data sets that were created in other formats. You can easily import such data sets into R.
The “rio” package allows you to import data sets in a variety of formats with one single function, import()
. You need to first load the package:
library("rio")
The import()
function “guesses” the format of the data from the file type extension, so that a file ending in .csv
} is read in as a comma-separated value file. If the file typ extension does not reveal the type of data (e.g., a tab-separated file saved with a .txt
extension), you need to provide the format
argument, as you see in the first example below. See the help file for import()
for more information.
Note that for each command, many options (in R language: arguments) are available; you will most likely need to work with these options at some time, for instance when your source dataset (e.g., in Stata) has value labels. Check the help files for the respective command in that case.
<- import("http://www.jkarreth.net/files/mydata.txt", format = "tsv")
mydata_from_tsv head(mydata_from_tsv)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
Alternatively, use read.table()
specifically for tab-separated files:
<- read.table("http://www.jkarreth.net/files/mydata.txt", header = TRUE)
mydata_from_tsv head(mydata_from_tsv)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
<- import("http://www.jkarreth.net/files/mydata.csv")
mydata_from_csv head(mydata_from_csv)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
Alternatively, use read.csv()
specifically for comma-separated files:
<- read.csv("http://www.jkarreth.net/files/mydata.csv")
mydata_from_csv head(mydata_from_csv)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
<- import("http://www.jkarreth.net/files/mydata.sav")
mydata_from_spss head(mydata_from_spss)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
<- import("http://www.jkarreth.net/files/mydata.dta")
mydata_from_dta head(mydata_from_dta)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
Alternatively, use read.dta()
from the “foreign” package specifically for Stata files:
library("foreign")
<- read.dta("http://www.jkarreth.net/files/mydata.dta")
mydata_from_dta head(mydata_from_dta)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
To obtain descriptive statistics of a dataset, or a variable, use the summary
command:
summary(mydata_from_dta)
## y x1 x2
## Min. :-1.2700 Min. :-1.970 Min. :-1.6900
## 1st Qu.:-0.5325 1st Qu.:-0.325 1st Qu.:-1.0600
## Median :-0.0800 Median : 0.380 Median :-0.6800
## Mean : 0.0740 Mean : 0.208 Mean :-0.4270
## 3rd Qu.: 0.3775 3rd Qu.: 0.650 3rd Qu.: 0.0575
## Max. : 1.7200 Max. : 1.790 Max. : 1.2500
summary(mydata_from_dta$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2700 -0.5325 -0.0800 0.0740 0.3775 1.7200
You can access particular quantities, such as standard deviations and quantiles (in this case the 5th and 95th percentiles), with the respective functions:
sd(mydata_from_dta$y)
## [1] 0.9561869
quantile(mydata_from_dta$y, probs = c(0.05, 0.95))
## 5% 95%
## -1.009 1.648
R offers several options to create figures. We will work with the so-called “base graphics”, mostly using the plot()
function, and the “ggplot2” package.
R’s base graphics are very versatile and, in our workshop, ideal for creating quick plots to inspect objects. These graphs are built sequentially, beginning with the plot()
command applied to an object. So, for instance to plot the density of 1000 draws from a normal distribution, you would use the following code. I’m using the set.seed()
command here before every simulation to ensure that the same values are drawn when you try these commands and make these plots.
set.seed(123)
<- rnorm(n = 1000, mean = 0, sd = 1)
dist1 set.seed(123)
<- rnorm(1000, mean = 0, sd = 2)
dist2 plot(density(dist1))
lines(density(dist2), col = "red")
The “ggplot2” package has become popular because its language and plotting sequence can be somewhat more convenient (depending on users’ background), especially when working with more complex datasets. For plotting Bayesian model output, ggplot2 offers some useful features. I will mostly use ggplot2 in this workshop because (in my opinion) it offers a quick and scalable way to produce figures that are useful for diagnostics and publication-quality output alike.
ggplot2 needs to be first loaded as an external package. Its key commands are ggplot()
and various types of plots, passed to R via geom_
commands. All commands are added via +
, either in one line or in a new line to an existing ggplot2 object. The command below contains a couple more data manipulation steps that will come in handy for us later; we will discuss them in the workshop. Here, I use the tidyr::pivot_longer
command to reshape the data so they can be plotted in one figure. When trying the code below, have a look at the structure of the dist.dat
object to see what’s going on.
library("ggplot2"); library("tidyr")
set.seed(123)
<- rnorm(n = 1000, mean = 0, sd = 1)
dist1 set.seed(123)
<- rnorm(1000, mean = 0, sd = 2)
dist2 <- data.frame(dist1, dist2)
dist.df <- pivot_longer(data = dist.df, cols = everything())
dist.long head(dist.long)
## # A tibble: 6 x 2
## name value
## <chr> <dbl>
## 1 dist1 -0.560
## 2 dist2 -1.12
## 3 dist1 -0.230
## 4 dist2 -0.460
## 5 dist1 1.56
## 6 dist2 3.12
<- ggplot(data = dist.long, aes(x = value, colour = name, fill = name))
normal.plot <- normal.plot + geom_density(alpha = 0.5)
normal.plot normal.plot
ggplot2 offers plenty of opportunities for customizing plots; we will also encounter these later on in the workshop. You can also have a look at Winston Chang’s R Graphics Cookbook for plenty of examples of ggplot2 customization: http://www.cookbook-r.com/Graphs.
Plots created via base graphics can be printed to a PDF file using the pdf()
command. This code:
set.seed(123)
<- rnorm(n = 1000, mean = 0, sd = 1)
dist1 set.seed(123)
<- rnorm(1000, mean = 0, sd = 2)
dist2 pdf("normal_plot.pdf", width = 5, height = 5)
plot(density(dist1))
lines(density(dist2), col = "red")
dev.off()
## quartz_off_screen
## 2
will print a plot named normal_plot.pdf
of the size 5 by 5 inches to your working directory.
Plots created with ggplot2 are best saved using the ggsave()
command:
ggsave(plot = normal.plot, filename = "normal_ggplot.pdf", width = 5, height = 5, unit = "in")
For project management and replication purposes, it is advantageous to combine your data analyis and writing in one framework. RMarkdown, Sweave and knitr are great solutions for this. The RStudio website has a good explanation of these options: http://rmarkdown.rstudio.com and https://support.rstudio.com/hc/en-us/articles/200552056-Using-Sweave-and-knitr. This tutorial and slides are written using knitr. Depending on interest, I may be able to offer another lab that will address RMarkdown as a tool for reproducible research, among other topics.
/Users/thomasbayes/Work/ICPSR/Homework/Lab1
.Go to http://gss.norc.org/ and download the General Social Survey raw data for 2018 in SPSS or Stata format. Save this file in an assignment-specific working directory. Then, create an R script that performs the following operations:
Comments
R scripts contain two types of text: R commands and comments. Commands are executed and perform actions. Comments are part of a script, but they are not executed. Comments begin with the
#
sign. Anything that follows after a#
sign in the same line will be ignored by R. Compare what happens with the following two lines:You should use comments frequently to annotate your script files in order to explain to yourself what you are doing in a script file.