The basic tidyverse method for removing observations is the filter function Filter only keeps observations where the condition is met for that variable.
Basic Conditions
#creating a copy to not change original dataset
tutorial_copy <- tutorial
#piping is used to work within the tutorial dataset
tutorial_copy <- tutorial_copy %>%
#only keeping observations where gender is equal to "F"
filter(gender == "F")
Logical statements can be used for more complicated filters ###More conditions
#creating a copy to not change original dataset
tutorial_copy <- tutorial
#piping is used to work within the tutorial dataset
tutorial_copy <- tutorial_copy %>%
#only keeping observations where gender is equal to M or F
filter(gender == "F"|gender == "M")
#restoring tutorial copy
tutorial_copy <- tutorial
#piping is used to work within the tutorial dataset
tutorial_copy <- tutorial_copy %>%
#only keeping observations where gender doesn't equal F
filter(gender != "F")
Removing variables from dataset
To choose variables to keep, we use the select function
tutorial_copy <- tutorial
tutorial_copy <- tutorial_copy %>%
select(gender, score)
To remove variables, use the identical function but with ‘-’ before variables we wish to remove
tutorial_copy <- tutorial
tutorial_copy <- tutorial_copy %>%
select(-gender, -score)
New Variables
To create new variables, use the mutate function. Multiple variables can be created at once with this function.
tutorial_copy <- tutorial
#just one variable
tutorial_copy <- tutorial_copy %>%
mutate(score_x_2 = score * 2)
##multiple
tutorial_copy <- tutorial_copy %>%
mutate(score_x_4 = score * 4,
score_x_3 = score * 3)
ifelse is a very useful function for creating categorical variables. It takes three arguments: a condition and two outcomes. It then assigns variables to the first outcome if it meets the condition, and to the second outcome if not.
tutorial_copy <- tutorial
#this function creates the variable score_greaterthan_10. This variable is equal to 1 if score is > 10, and 0 if score < 10
tutorial_copy <- tutorial_copy %>%
mutate(score_greaterthan_10 = ifelse(score > 10, 1 , 0))
Nested ifelse statements can create more than two buckets. There are some other functions that can do this more efficiently, depending on the specific task, but that is beyond the scope of this tutorial.
tutorial_copy <- tutorial
tutorial_copy <- tutorial_copy %>%
#The first ifelse states that if score is <10, score_buckets should be equal to 0, but if score is > 10, we should go into a second ifelse statement.
mutate(score_buckets = ifelse(score < 10, 0 ,
ifelse(score >= 10 & score < 20, 1, 2)))
#the second ifelse only is read if the first ifelse condition isn't met. It tells us that we should set score_buckets to 1 if score is between 10 and 20, and 2 otherwise.
Renaming Variables
Renaming variables is simple. Use the rename function
tutorial_copy <- tutorial
tutorial_copy <- tutorial_copy %>%
rename(identification = id)
Reshaping
Sometimes, it will be important to change the format of your data from wide to long or long to wide. A long dataset changes the unit of analysis such that each row contains less data, and as a result, there are more rows. This example should show the difference.
Note that both datasets show the exact same information. However, it is usually easier to work with one type of data, depending on the task at hand.

(a) Wide to long
Gather is used to reshape data from wide to long. This transformation tends to be more common. Currently, each row contains two scores: score and score2. When we reshape the data to long, we will have create a new variable, ‘round’, that tells us which type of score the data is. So, each row will only contain one score instead of two, which means that then number of rows will double.
The gather function takes 4 arguments; 1)The dataset, which in this case is piped 2)A key string, which is the name of the column used to distinguish between the different types of values 3)A value string, which is the name of the column of the combined outcome variable, in this case score 4)A vector, which contains the columns that should be combined into 1, in this case score
tutorial_long <- tutorial %>%
gather("score_type", "score", c(score, score2))
While in this example we use only a single column, gather can be used for multiple columns. Gather isn’t the easiest command to use, so make sure you know exactly what you want your data to look like before using it.
(b) Long to wide
Spread is the function used to reshape data from long to wide. In the tutorial_long dataset, there should now be 12 observations of the dataset. We will shift it back to its original form using spread.
Spread takes three arguments: 1)The dataset, which in this case is piped 2)A key, which is the variable in the dataset that describes which type of data it is. In this case, it is score_type, as it distinguishes between score and score2 3)A value, which is the variable that contains the information, which in this case is score.
tutorial_wide <- tutorial_long %>%
spread(score_type, score)
Now, each row contains two observations, one of each score type. Similarly to gather, this commmand can be tricky to use, and you must have a clear understanding of your data.
Missing Values
In R, is.na(variable) evaluates to true if the variable is missing. !is.na(variable) does the opposite.
tutorial_copy <- tutorial
tutorial_copy <- tutorial_copy %>%
mutate(missing_id = is.na(id))
tutorial_copy <- tutorial_copy %>%
mutate(non_missing_id = !is.na(id))
Datasets: Altering or Creating New
In all of these examples, I have been creating a copy of my dataset and then changing that copy. That is unnecessary in R. For example, to create a tutorial_copy dataset with a missing_id variable, I can simply do:
tutorial_copy <- tutorial %>%
mutate(missing_id = is.na(id))
However, I could also simply adjust my original tutorial dataset. While the ability to hold multiple datasets at once is a convenient feature of R, that doesn’t mean you want to create a new dataset with every single change. It is up to the user to figure out how to create protocols for when to create a new dataset, and when to adjust an old one.
Summary statistics
R has somewhat more challenging syntax than Stata for summary statistics, but also allows for a greater level of customization in choice of summary statistics. You can also easily create and work with summary statistic datasets, which tends to be a poor opton in Stata.
Summarize
Summarize is the main tidyverse tool for both looking at summary statistics and creating summary datasets. Like other tidyverse commands, it works best with piping. A variety of statistics, such as mean, minmax, number of observations, standard deviation, and more can be calculated this way.
#First create a dataset without null values, or summarize returns NA
tutorial_no_NA <- tutorial %>%
filter(!is.na(score))
##We can create one specific summary statisitcs
tutorial_no_NA %>%
summarize(avg_score = mean(score))
## avg_score
## 1 25.9
##Or multiple
tutorial_no_NA %>%
summarize(avg_score = mean(score), sd_score = sd(score))
## avg_score sd_score
## 1 25.9 15.82157
##We can use filter to summarize conditionally
tutorial_no_NA %>%
filter(score > 5) %>%
summarize(avg_score = mean(score), sd_score = sd(score))
## avg_score sd_score
## 1 28.66667 13.98213
##We can create a dataset of summary statistics
tutorial_summary <- tutorial_no_NA %>%
summarize(avg_score = mean(score), sd_score = sd(score))
group_by
The ‘group_by’ function is similar to Stata’s ‘by’ command. It splits data into a bunch of smaller chunks, and then applies the next set of code to each chunk individually. This allows you to get group counts, similar to tab.
##Standard count of gender
tutorial %>%
group_by(gender) %>%
summarize(count = n())
## # A tibble: 2 x 2
## gender count
##
## 1 F 5
## 2 M 6
##Can do multiple groups
tutorial %>%
group_by(gender, position) %>%
summarize(count = n())
## # A tibble: 4 x 3
## # Groups: gender [2]
## gender position count
##
## 1 F Rassistant 2
## 2 F Rassociate 3
## 3 M Intern 2
## 4 M Surveyor 4
##And can filter
tutorial %>%
filter(score > 10) %>%
group_by(gender, position) %>%
summarize(count = n())
## # A tibble: 4 x 3
## # Groups: gender [2]
## gender position count
##
## 1 F Rassistant 2
## 2 F Rassociate 3
## 3 M Intern 1
## 4 M Surveyor 3
We can go beyond counts and get group means, maxs, etc
##Mean Score By Gender
tutorial %>%
group_by(gender) %>%
summarize(mean = mean(score))
## # A tibble: 2 x 2
## gender mean
##
## 1 F 32.4
## 2 M NA
##With Filtering
tutorial %>%
filter(score > 10) %>%
group_by(gender) %>%
summarize(mean = mean(score))
## # A tibble: 2 x 2
## gender mean
##
## 1 F 32.4
## 2 M 24
group_by is useful for creating new summary statistic datasets. Many of the ggplot commands work much better with data in this format
gender_means <- tutorial %>%
group_by(gender) %>%
summarize(mean = mean(score))
Data Visualization
ggplot
ggplot is an R package designed to be a ‘grammar of graphics’. This means that rather than memorizing syntax for a given output, creating ggplots requires thinking about how to map data onto different visual features. This allows for much greater control of graphics and for more logically created graphics. For more information about the philsophy of ggplot, read this - The ggplot command requires two inputs: data and aesthetics, although you will almost always want to include geoms, and often themes, labels, and scaling Through this section, we will iteratively build a beautiful plot.
Aesthetics
Aesthetics describe how your data is being mapped onto a physical plot (what this means may be unclear at first, so hang in there). Every ggplot requires at least one aesthetic, but more can be used.
#aes is short for aesthetics
p <- ggplot(data = tutorial, aes(x = score, y = score2))
p
As you can see, the x axis is score, and the y axis is score2. This is what aesthetics does; it takes a variable from the dataset and maps that variable onto a feature of the graph. In this case, we only use two axes, but much more is possible.
Geoms
We currently have an empty graph, which isn’t telling us much of anything. What we now want to do is add a geom, which are different kinds of physical shapes that represent the data that has been mapped to aesthetics.
#to add features to a ggplot, we use +
p <- p + geom_point()
p
You can use different kinds of geoms, still using the same aesthetics
#resetting p
p <- ggplot(data = tutorial, aes(x = score, y = score2))
#don't worry about the syntax, but this adds a regression line
p <- p + geom_smooth(method = 'lm',se = F)
p
You can, and often should use multiple geoms on the same graph
#don't worry about the syntax, but this adds a regression line
p <- p + geom_point()
p
There are also some aesthetics that work best with geoms. For example, now we can add in gender as a third aesthetic, and map it to color
p <- ggplot(data = tutorial, aes(x = score, y = score2, color = gender)) +
geom_point() +
geom_smooth(method = 'lm', se = F)
p
If you want some geoms to have aesthetics but not others, you can use aes in the geom itself, rather than the ggplot function. In this plot, the points are colored by gender, but we only have a single regresion line.
p <- ggplot(data = tutorial, aes(x = score, y = score2)) +
geom_point(aes(color = gender)) +
geom_smooth(method = 'lm', se = F)
p
As you can see, with just the basic structure of aesthetics and geoms, you can build more and more complex plots. You are also able to build iteratively, as you can start with single aesthetics and geoms, and add and take away features until your data is shown most clearly.
Some common geoms are: * geom_point (points for scatterplots) * geom_smooth (estimate lines, can be linear or nonlinear) * geom_bar (bar charts) * geom_density (density plots) * geom_histogram (histograms) * geom_abline (unfitted lines, for example to show what a linear relationship would look like) * geom_errorbar (error bars) * geom_text (for labeling bar chart estimates, for example)
Scales
By default, ggplot scales the graph to the data, so p is scaled from 0-50 on the x axis, and 0-40 on the y axis. If we assume that score and score2 are similar in nature, we would want them to be scaled identically, or we may mislead the audience. To do this, we use the function scale_AESTHETIC_SCALING, where in this case AESTHETIC is y, and SCALING is continous.
p <- p + scale_y_continuous(limits = c(0, 50))
p
Any aesthetic can be changed using the scale command, and you will have the options of continous, discrete, or manual, depending on the type of data.
Labels and Legends
Our graph, p now does depict certain aspects of our data. However, it isn’t very clear what exactly it is showing, as there is no title, and the legend and axis labels are based on the data, and therefore unclear. The main function used for labeling is labs, which can affect the title and any aesthetics.
p <- p + labs(title = "Scores By Gender", x = "Score 1", y = "Score 2", color = "Gender")
p
the legend labels using the scale function from the previous section.
p <- p + scale_color_discrete(labels = c("Female", "Male"))
p
Themes
Now, all that is left to do is make the graph look nicer. To do this, we use the theme command. There are dozens of options that can be used for the theme command, so you will have to figure out for yourself what you think works best. The syntax is straightforward; this is how you would remove the panel grid.
p <- p + theme(panel.grid = element_blank())
p
The best practice is to create a theme and then use it across all charts. This is another situation in which R’s flexibility in storing objects is very useful. For example, we can create a theme like so.
my_theme <- theme_bw() +
theme(text = element_text(size = 10, face = "bold", color = "black"),panel.grid = element_blank(),axis.text = element_text(size = 10, color = "gray13"), axis.title = element_text(size = 10, color = "black"), legend.text = element_text(colour="Black", size=10), legend.title = element_text(colour="Black", size=7), plot.subtitle = element_text(size=14, face="italic", color="black"))
And then use it for not only this graph, but all later graphs
It is even possible to have your theme be a function, which allows you to customize your theme for different situations, but that is a more advanced topic and unneccessary for creating excellent ggplots.
Data Types
One thing that may happen to you as you try creating ggplots is that you sometimes have hard-to-understand errors or create useless graphs. For example,
bad <- ggplot(data = tutorial, aes(x = gender)) + geom_density()
bad
is entirely meaningless,while
#error <- ggplot(data = tutorial, aes(x = gender)) + geom_bar(stat = 'identity')
#error
gives you an error, even though the syntax is accurate (commented out to avoid having the file crash). This is because certain plots only work with certain kinds of data. For example, a bar chart will work best with one continuous variable and one categorical variable. Scatterplots work well with two continuous variables. Whether a misuse of a variable results in an error or a bad graph depends on the specific type of error, but either way, you need to understand your data before trying to visualizate it.
Advanced ggplot
Once you have gotten comfortable with ggplot, there are some more complicated packages that can be used alongside ggplot to create some more advanced graphics.
- shiny – allows you to create interactive dashboards, like this
- ggmap – for creating and customizing maps
- ggraph/dendextend – For higher-dimensional data, can create network plots, dendrograms
Examples and more info
Here are some articles that have more examples of ggplot syntax and technique, along with some more information about the package.
Regressions
Basic Regressions
A standard linear regression uses the command lm takes two arguments: A formula and data. The syntax works as follows
reg <- tutorial %>%
#score is the dependent variable, score2 is the independent variable. Our formula is score = B(score2) + error
#data = , is used because we pass in the tutorial dataset using piping
lm(score ~ score2, data = .)
We can use multiple independent variables/controls using the same syntax, with + separating different regression inputs in the formula.
reg_gend <- tutorial %>%
#In this case, our formula is score = B1(score2) + B2(gender) + error
lm(score ~ score2 + gender, data = .)
And we can measure interaction effects using * in our formula.
reg_gend_interation <- tutorial %>%
lm(score ~ score2 + gender + score2*gender, data = .)
Output
You may have noticed that there hasn’t been an output from the previous regressions. This is because all we did is create the regression object (which happpens to be a list). There are a few ways to see regression results.
The easiest of these is summary. When using regressions for
##
## Call:
## lm(formula = score ~ score2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.835 -10.674 -5.285 14.339 20.352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7007 10.3570 1.323 0.227
## score2 0.3624 0.3219 1.126 0.297
##
## Residual standard error: 15.12 on 7 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.1533, Adjusted R-squared: 0.03237
## F-statistic: 1.268 on 1 and 7 DF, p-value: 0.2973
tutorial %>%
lm(score ~ score2 + gender + score2*gender, data = .) %>%
summary
##
## Call:
## lm(formula = score ~ score2 + gender + score2 * gender, data = .)
##
## Residuals:
## 1 2 3 4 5 6 8 9 11
## 4.719 9.389 -5.724 16.731 -13.833 18.186 -17.087 -3.348 -9.033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.9046 15.7829 1.451 0.206
## score2 0.2356 0.4826 0.488 0.646
## genderM -17.6895 22.3902 -0.790 0.465
## score2:genderM 0.2675 0.6951 0.385 0.716
##
## Residual standard error: 16.31 on 5 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.2959, Adjusted R-squared: -0.1265
## F-statistic: 0.7005 on 3 and 5 DF, p-value: 0.591
When we create a regression and store it as an object, we have created a list that contains tables. When including regression output in a report, using a table display package such as kableExtra is the best method of outputting the results. The syntax is somewhat complicated, so it will not be included in this document. A quick internet search will provide examples and guides like this one.
Other regressions
Other regressions usually require a package, but use the same syntax as lm. For example, for a fixed effects regression, one must install the lfe package, and use a similar command to above, but with ‘felm’ replacing ‘lm’. For a logistic regression, the glm package is required, and ‘glm’ replaces ‘lm’ in the above commands. There are more regression types and packages than can be employed for the circumstance.
Functional progamming
It is not necessary to use a functional programming to use R and Tidyverse effectivelty, especially for smaller tasks. In fact, we recommend that you learn the basics of R before switching to a functional style of programming, as you will need to be comfortable with R and its syntax. However, once you gain some experience, functions are an excellent tool for reducing errors, creating more readable code, taking less time to finish tasks, and debugging more easily. For longer, more complex programs, they are incredibly useful.
What are functions?
Functions are similar to commands in stata – they take in some kind input, and do something based on that input, like create a new dataset or a plot. To code in R is to use functions created by other people. These functions, of course, are designed to be quite general.
What is functional programming?
Functional programming is a method of programming in which a script with a large, complicated goal (like cleaning a dataset) is broken up into smaller, more modular, hopefully repeatable tasks. For each smaller task, you create a user-defined function, designed for your specific task. Your main script will then run only a few lines of code, where each line is a function you have created. Each function can include other, smaller functions, which can be used repeatedly.
Why use functional programming?
The main reason to use functional coding is to avoid repetition. Without functions, when you repeat the same task, you will either have to * create a loop, which can be unfeasible and is often confusing, as well as prone to bugs, OR * copy a paste a lot of code, which is hard to follow, time-consuming, and creates a lot of opportunities for errors.
Functions allow you to do repetitive tasks with minimal amounts of code. For example, imagine that for a group of variables, you want to get rid of missing values, and then use a command to predict the values of for those observations. Rather than continuously using the same 4 lines, you can create a function containing those 4 lines. You can then run that function on each variable. Furthermore, there is a tidyverse function that allows you to do all of this in only one line.
It may not seem like a big deal when the function is only 4 lines of code; why not just repeat it a few times. However, when tasks get more complicated, repeating them over and over can be massively inefficient; retyping 100 lines of code for 12 variables with slightly different syntax is not how you want to spend your time coding, and is also very difficult for another user to follow.
Syntax
Functions, like many other aspects of R, are stored as objects in R. Imagine I want to create a simple function called replace_missing_with_number, which as the name suggests, takes the missing values in a vector and replaces them with a number. I label replace_missing_with number, and include three things in each function:
- 0, 1, or multiple inputs, that will be used in the function body
- A function body, which can include commands, other function calls, and more
- A return statement, which is the output of the function
#we define replace_missing_with_number as a function requiring two inputs: a variable and a number
#the function won't work unless variable is a vector and number is an integer
replace_missing_with_number <- function(variable, number) {
#the function body is included between { and }. In this case, it is only two lines, but functions can be 10s or 1000s of lines long
variable <- ifelse(is.na(variable), number, variable)
#At the end, a return command is neccessary to denote the function's output.
return(variable)
}
Now that the function has been defined, it can be used at any time
score <- tutorial$score
id <- tutorial$id
#it works as promised with a vector
score <- replace_missing_with_number(score, 0)
#and can be reused
id <- replace_missing_with_number(id, 999)
#it does nothing with a dataframe
tutorial_copy <- replace_missing_with_number(tutorial, 0)
Apply
The apply family of functions is one of the most convenient features of R, and a great reason to use a functional style of programming. The three most basic commands – apply, lapply, and sapply – allow you to repeat a function over a matrix, list, or vector. This means that by creating a function and running one or two commands, you can apply a cleaning function to an entire dataset, without using any for loops!
The apply function is somewhat advanced, so this tutorial will only show an example of the sapply function. Here is an intraductory explanation, while here is a more advanced and complete explanation of all of the apply functions.
In this example, we will apply our replace_missing_with_number function to each row of our dataset. It only takes two lines(that could be condensed to one, which is much easier than a for loop)
Examples
This link discusses the reason for using functional programming in greater detail, and includes some more complicated functions.
R Markdown
This document was created entirely in R Markdown, R’s document creating tool. Using R Markdown is incredibly simple and straightforward. In the Rstudio interface, click on File -> New File -> R Markdown. It will open up a tab in which you can input the author, title, and output (these can all be changed later). After clicking OK, a new tab will appear. It should look something like this

The way R Markdown works is that it doesn’t treat text as code unless specifically instructed too. So, if you write a command like read_csv(filepath), and then try to run it, nothing will happen. To create R code, use the following structure

This way, you can switch back and forth between plain text like this
print("and code like this")
## [1] "and code like this"
When you want to see the output of your RMD script, you click the knit button on the Rstudio interface. You can then choose to knit to html, word, or pdf. There is no cost of knitting, so I suggest you knit early and often to make sure that everything is working.
Once you have a document you like, there are various publishing options. Most only work with html, which tends to be the most compatible with R Markdown in general.
There are also a variety of advanced options for R Markdowns. These include creating a table of contents, putting different outputs into tabs to reduce the length of your report, hiding code so reports only show output and text, and a host of text formatting tools.
Cheat sheet of useful functions
This is not a complete list. To do other things, you will often need to google, but R is quite well-documented.
Installing packages