Appendix D — R Functions and Packages Reference

This appendix provides a quick reference for all the R functions and packages used throughout this book. Use this as a handy guide when you need to remember the syntax or arguments for a particular function.

D.1 Packages Used in This Book

D.1.1 tidyverse

The tidyverse is a collection of R packages designed for data science. When you load tidyverse, it automatically loads several packages including ggplot2, dplyr, tidyr, readr, and others.

What it’s used for:

  • Data manipulation and transformation (dplyr)
  • Data visualization (ggplot2)
  • Reshaping data (tidyr)
  • Reading data files (readr)

Loading:

library(tidyverse)

D.1.2 modelsummary

The modelsummary package creates publication-quality tables for regression results and descriptive statistics.

What it’s used for:

  • Creating formatted regression tables comparing multiple models
  • Generating descriptive statistics tables
  • Exporting tables to Word, LaTeX, or HTML

Loading:

library(modelsummary)

D.1.3 palmerpenguins

A dataset package containing measurements of penguins from Palmer Station, Antarctica. Great for learning data visualization and exploration.

What it’s used for:

  • Practice dataset for learning R
  • Data visualization examples
  • Regression examples

Loading:

library(palmerpenguins)
data(penguins)

D.1.4 wooldridge

Contains all datasets from Wooldridge’s Introductory Econometrics textbook.

What it’s used for:

  • Real-world econometric datasets
  • Regression examples (wages, crime, etc.)

Loading:

library(wooldridge)
data(wage1)  # Example: load the wage1 dataset

D.1.5 fixest

A fast and powerful package for fixed effects and panel data regression.

What it’s used for:

  • Panel data regression with fixed effects
  • Clustered standard errors
  • Very large datasets (faster than lm())

Loading:

library(fixest)

D.1.6 lmtest

Provides diagnostic tests for linear regression models.

What it’s used for:

  • Breusch-Pagan test for heteroskedasticity
  • Other specification tests

Loading:

library(lmtest)

D.1.7 sandwich

Provides robust covariance matrix estimators (heteroskedasticity-consistent standard errors).

What it’s used for:

  • Robust standard errors
  • Heteroskedasticity-consistent (HC) standard errors

Loading:

library(sandwich)

D.1.8 patchwork

Easily combine multiple ggplot2 plots into a single figure.

What it’s used for:

  • Arranging multiple plots side by side
  • Creating multi-panel figures

Loading:

library(patchwork)

D.2 Data Inspection Functions

D.2.2 glimpse()

Get a compact overview of a data frame’s structure and column types.

Arguments:

  • x: A data frame

Example:

glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

D.2.3 summary()

Generate summary statistics for each variable.

Arguments:

  • object: A data frame, vector, or model object

Example:

summary(mtcars$mpg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 

D.2.4 nrow() and ncol()

Return the number of rows or columns in a data frame.

Arguments:

  • x: A data frame or matrix

Example:

nrow(mtcars)
[1] 32
ncol(mtcars)
[1] 11

D.2.5 colnames()

Return or set the column names of a data frame.

Arguments:

  • x: A data frame or matrix

Example:

colnames(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

D.2.6 class() and typeof()

Return the class or underlying type of an object.

Arguments:

  • x: Any R object

Example:

class(mtcars$mpg)
[1] "numeric"
typeof(mtcars$mpg)
[1] "double"

D.3 Descriptive Statistics Functions

D.3.1 mean()

Calculate the arithmetic mean.

Arguments:

  • x: A numeric vector
  • na.rm: Remove missing values? (default: FALSE)

Example:

x <- c(10, 20, 30, NA, 50)
mean(x, na.rm = TRUE)
[1] 27.5

D.3.2 sd()

Calculate the standard deviation.

Arguments:

  • x: A numeric vector
  • na.rm: Remove missing values? (default: FALSE)

Example:

sd(mtcars$mpg)
[1] 6.026948

D.3.3 median()

Calculate the median (middle value).

Arguments:

  • x: A numeric vector
  • na.rm: Remove missing values? (default: FALSE)

Example:

median(mtcars$mpg)
[1] 19.2

D.3.4 min() and max()

Find the minimum or maximum value.

Arguments:

  • ...: Numeric vectors
  • na.rm: Remove missing values? (default: FALSE)

Example:

min(mtcars$mpg)
[1] 10.4
max(mtcars$mpg)
[1] 33.9

D.3.5 sum()

Calculate the sum of all values.

Arguments:

  • ...: Numeric vectors
  • na.rm: Remove missing values? (default: FALSE)

Example:

sum(mtcars$mpg)
[1] 642.9

D.3.6 var()

Calculate the variance.

Arguments:

  • x: A numeric vector
  • na.rm: Remove missing values? (default: FALSE)

Example:

var(mtcars$mpg)
[1] 36.3241

D.3.7 cor()

Calculate the correlation between two variables.

Arguments:

  • x, y: Numeric vectors
  • use: How to handle missing values (e.g., "complete.obs")

Example:

cor(mtcars$mpg, mtcars$hp)
[1] -0.7761684

D.4 Data Manipulation Functions (dplyr)

D.4.1 select()

Choose which columns to keep.

Arguments:

  • .data: A data frame
  • ...: Column names or selection helpers

Example:

mtcars |>
  select(mpg, cyl, hp) |>
  head(3)
               mpg cyl  hp
Mazda RX4     21.0   6 110
Mazda RX4 Wag 21.0   6 110
Datsun 710    22.8   4  93

D.4.2 filter()

Keep rows that meet a condition.

Arguments:

  • .data: A data frame
  • ...: Logical conditions

Example:

mtcars |>
  filter(mpg > 25) |>
  head(3)
                mpg cyl disp hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1

D.4.3 mutate()

Create new variables or modify existing ones.

Arguments:

  • .data: A data frame
  • ...: Name-value pairs of expressions

Example:

mtcars |>
  mutate(kpl = mpg * 0.425) |>  # Convert to km per liter
  select(mpg, kpl) |>
  head(3)
               mpg   kpl
Mazda RX4     21.0 8.925
Mazda RX4 Wag 21.0 8.925
Datsun 710    22.8 9.690

D.4.4 group_by()

Group data by one or more variables (usually followed by summarize()).

Arguments:

  • .data: A data frame
  • ...: Variables to group by

Example:

mtcars |>
  group_by(cyl) |>
  summarize(avg_mpg = mean(mpg))
# A tibble: 3 × 2
    cyl avg_mpg
  <dbl>   <dbl>
1     4    26.7
2     6    19.7
3     8    15.1

D.4.5 summarize() / summarise()

Calculate summary statistics for each group.

Arguments:

  • .data: A data frame (usually grouped)
  • ...: Name-value pairs of summary functions

Example:

mtcars |>
  group_by(cyl) |>
  summarize(
    avg_mpg = mean(mpg),
    sd_mpg = sd(mpg),
    n = n()
  )
# A tibble: 3 × 4
    cyl avg_mpg sd_mpg     n
  <dbl>   <dbl>  <dbl> <int>
1     4    26.7   4.51    11
2     6    19.7   1.45     7
3     8    15.1   2.56    14

D.4.6 count()

Count observations by group.

Arguments:

  • x: A data frame
  • ...: Variables to count by
  • sort: Sort by count? (default: FALSE)

Example:

mtcars |>
  count(cyl)
  cyl  n
1   4 11
2   6  7
3   8 14

D.4.7 arrange()

Sort rows by one or more variables.

Arguments:

  • .data: A data frame
  • ...: Variables to sort by (use desc() for descending)

Example:

mtcars |>
  arrange(desc(mpg)) |>
  head(3)
                mpg cyl disp hp drat    wt  qsec vs am gear carb
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2

D.4.8 case_when()

Vectorized if-else for creating categorical variables.

Arguments:

  • ...: A sequence of two-sided formulas: condition ~ value

Example:

mtcars |>
  mutate(
    mpg_category = case_when(
      mpg < 15 ~ "Low",
      mpg < 25 ~ "Medium",
      TRUE ~ "High"
    )
  ) |>
  count(mpg_category)
  mpg_category  n
1         High  6
2          Low  5
3       Medium 21

D.4.9 ifelse()

Simple conditional: if condition is TRUE, return one value; otherwise, another.

Arguments:

  • test: A logical condition
  • yes: Value if TRUE
  • no: Value if FALSE

Example:

mtcars |>
  mutate(efficient = ifelse(mpg > 20, 1, 0)) |>
  count(efficient)
  efficient  n
1         0 18
2         1 14

D.4.10 n()

Count the number of observations in the current group (used inside summarize()).

Arguments: None

Example:

mtcars |>
  group_by(cyl) |>
  summarize(count = n())
# A tibble: 3 × 2
    cyl count
  <dbl> <int>
1     4    11
2     6     7
3     8    14

D.5 Regression Functions

D.5.1 lm()

Fit a linear model using Ordinary Least Squares (OLS).

Arguments:

  • formula: A formula like y ~ x1 + x2
  • data: A data frame

Example:

reg <- lm(mpg ~ hp + wt, data = mtcars)
summary(reg)

Call:
lm(formula = mpg ~ hp + wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,    Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

D.5.2 summary() (for regression)

Display detailed regression results including coefficients, standard errors, t-statistics, and p-values.

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp, data = mtcars)
summary(reg)

D.5.3 coef()

Extract the estimated coefficients from a model.

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp, data = mtcars)
coef(reg)
(Intercept)          hp 
30.09886054 -0.06822828 

D.5.4 confint()

Calculate confidence intervals for model coefficients.

Arguments:

  • object: A fitted model object
  • level: Confidence level (default: 0.95)

Example:

reg <- lm(mpg ~ hp, data = mtcars)
confint(reg, level = 0.95)
                  2.5 %     97.5 %
(Intercept) 26.76194879 33.4357723
hp          -0.08889465 -0.0475619

D.5.5 predict()

Generate predicted values from a fitted model.

Arguments:

  • object: A fitted model object
  • newdata: Optional data frame with new predictor values

Example:

reg <- lm(mpg ~ hp, data = mtcars)
# Predict for existing data
head(predict(reg))
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
         22.59375          22.59375          23.75363          22.59375 
Hornet Sportabout           Valiant 
         18.15891          22.93489 
# Predict for new data
predict(reg, newdata = data.frame(hp = c(100, 150, 200)))
       1        2        3 
23.27603 19.86462 16.45320 

D.5.6 residuals()

Extract residuals (prediction errors) from a fitted model.

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp, data = mtcars)
head(residuals(reg))
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
       -1.5937500        -1.5937500        -0.9536307        -1.1937500 
Hornet Sportabout           Valiant 
        0.5410881        -4.8348913 

D.5.7 feols() (fixest package)

Fit linear models with fixed effects and robust standard errors.

Arguments:

  • fml: A formula (use | for fixed effects)
  • data: A data frame
  • vcov: Type of standard errors (e.g., "HC1" for robust)

Example:

library(fixest)
reg <- feols(mpg ~ hp + wt, data = mtcars, vcov = "HC1")
summary(reg)

D.5.8 modelsummary() (modelsummary package)

Create publication-quality regression tables.

Arguments:

  • models: A model or list of models
  • stars: Show significance stars?
  • gof_map: Which goodness-of-fit statistics to show

Example:

library(modelsummary)
reg1 <- lm(mpg ~ hp, data = mtcars)
reg2 <- lm(mpg ~ hp + wt, data = mtcars)
modelsummary(list(reg1, reg2), stars = TRUE)

D.6 Visualization Functions (ggplot2)

D.6.1 ggplot()

Initialize a ggplot object.

Arguments:

  • data: A data frame
  • mapping: Aesthetic mappings created by aes()

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point()


D.6.2 aes()

Define aesthetic mappings (which variables map to x, y, color, etc.).

Arguments:

  • x, y: Variables for axes
  • color, fill: Variables for color
  • size, shape: Variables for size and shape

Example:

ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
  geom_point()


D.6.3 geom_point()

Add points to a plot (scatterplot).

Arguments:

  • size: Point size
  • color: Point color
  • alpha: Transparency (0 to 1)

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(size = 3, color = "steelblue", alpha = 0.7)


D.6.4 geom_line()

Add lines connecting points.

Arguments:

  • linewidth: Line thickness
  • color: Line color
  • linetype: Line type (e.g., “dashed”)

Example:

# Line plot (useful for time series)
df <- data.frame(x = 1:10, y = cumsum(rnorm(10)))
ggplot(df, aes(x = x, y = y)) +
  geom_line(color = "steelblue", linewidth = 1)


D.6.5 geom_histogram()

Create a histogram.

Arguments:

  • binwidth: Width of each bin
  • bins: Number of bins
  • fill: Bar fill color
  • color: Bar outline color

Example:

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 3, fill = "steelblue", color = "white")


D.6.6 geom_boxplot()

Create a boxplot.

Arguments:

  • fill: Box fill color
  • outlier.color: Color of outlier points

Example:

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "lightblue")


D.6.7 geom_bar() and geom_col()

Create bar charts. geom_bar() counts observations; geom_col() uses values directly.

Arguments:

  • fill: Bar fill color
  • stat: For geom_bar(), use "identity" to plot values directly

Example:

# geom_bar counts automatically
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar(fill = "steelblue")


D.6.8 geom_smooth()

Add a smoothed conditional mean (often a regression line).

Arguments:

  • method: Smoothing method (e.g., "lm" for linear)
  • se: Show confidence interval? (default: TRUE)
  • color: Line color

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "red")
`geom_smooth()` using formula = 'y ~ x'


D.6.9 geom_hline() and geom_vline()

Add horizontal or vertical reference lines.

Arguments:

  • yintercept / xintercept: Where to draw the line
  • linetype: Line type (e.g., "dashed")
  • color: Line color

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_hline(yintercept = mean(mtcars$mpg), linetype = "dashed", color = "red")


D.6.10 labs()

Add labels to the plot (title, axis labels, etc.).

Arguments:

  • title: Plot title
  • x, y: Axis labels
  • color, fill: Legend titles

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  labs(
    title = "Fuel Efficiency vs. Horsepower",
    x = "Horsepower",
    y = "Miles per Gallon"
  )


D.6.11 facet_wrap()

Create small multiples (separate panels for each level of a variable).

Arguments:

  • facets: A formula like ~ variable
  • nrow, ncol: Number of rows or columns
  • scales: Should scales be fixed or free?

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  facet_wrap(~ cyl, nrow = 1)


D.6.12 theme_minimal()

Apply a clean, minimal theme to the plot.

Arguments: None (or see theme() for customization)

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  theme_minimal()


D.7 Utility Functions

D.7.1 c()

Combine values into a vector.

Arguments:

  • ...: Values to combine

Example:

x <- c(1, 2, 3, 4, 5)
x
[1] 1 2 3 4 5

D.7.2 seq()

Generate a sequence of numbers.

Arguments:

  • from: Starting value
  • to: Ending value
  • by: Increment
  • length.out: Desired length of sequence

Example:

seq(from = 0, to = 10, by = 2)
[1]  0  2  4  6  8 10
seq(from = 0, to = 1, length.out = 5)
[1] 0.00 0.25 0.50 0.75 1.00

D.7.3 rep()

Repeat values.

Arguments:

  • x: Value(s) to repeat
  • times: Number of times to repeat

Example:

rep(c("A", "B"), times = 3)
[1] "A" "B" "A" "B" "A" "B"

D.7.4 sample()

Take a random sample.

Arguments:

  • x: Vector to sample from
  • size: Number of items to sample
  • replace: Sample with replacement?

Example:

set.seed(123)
sample(1:10, size = 5, replace = FALSE)
[1]  3 10  2  8  6

D.7.5 set.seed()

Set the random number seed for reproducibility.

Arguments:

  • seed: An integer

Example:

set.seed(42)
rnorm(3)  # Will always produce the same values
[1]  1.3709584 -0.5646982  0.3631284

D.7.6 rnorm()

Generate random numbers from a normal distribution.

Arguments:

  • n: Number of values to generate
  • mean: Mean of the distribution (default: 0)
  • sd: Standard deviation (default: 1)

Example:

set.seed(123)
rnorm(5, mean = 100, sd = 15)
[1]  91.59287  96.54734 123.38062 101.05763 101.93932

D.7.7 factor()

Create a factor (categorical variable).

Arguments:

  • x: A vector
  • levels: The allowed levels (in order)
  • labels: Labels for the levels

Example:

education <- c("HS", "College", "HS", "Graduate")
factor(education, levels = c("HS", "College", "Graduate"))
[1] HS       College  HS       Graduate
Levels: HS College Graduate

D.7.8 as.factor(), as.numeric(), as.character()

Convert objects to a different type.

Arguments:

  • x: Object to convert

Example:

x <- c("1", "2", "3")
as.numeric(x)
[1] 1 2 3

D.7.9 is.na()

Check for missing values.

Arguments:

  • x: A vector or data frame

Example:

x <- c(1, 2, NA, 4)
is.na(x)
[1] FALSE FALSE  TRUE FALSE
sum(is.na(x))  # Count missing values
[1] 1

D.7.10 library()

Load an installed package.

Arguments:

  • package: Name of the package (unquoted)

Example:

library(tidyverse)

D.7.11 install.packages()

Install a package from CRAN (run once, not in scripts).

Arguments:

  • pkgs: Package name (quoted)

Example:

install.packages("tidyverse")

D.8 The Pipe Operator: |>

The pipe operator takes the output from the left side and passes it as the first argument to the function on the right. This allows you to chain operations together in a readable way.

Example without pipe:

# Nested functions are hard to read
summary(select(filter(mtcars, mpg > 20), mpg, hp))

Example with pipe:

# Piped version is much clearer
mtcars |>
  filter(mpg > 20) |>
  select(mpg, hp) |>
  summary()
      mpg              hp       
 Min.   :21.00   Min.   : 52.0  
 1st Qu.:21.43   1st Qu.: 66.0  
 Median :23.60   Median : 94.0  
 Mean   :25.48   Mean   : 88.5  
 3rd Qu.:29.62   3rd Qu.:109.8  
 Max.   :33.90   Max.   :113.0  
Keyboard Shortcut

In RStudio, type Cmd/Ctrl + Shift + M to insert the pipe operator.


D.9 Statistical Distribution Functions

R has a consistent naming convention for distribution functions. Each distribution has four functions, prefixed by a letter:

  • d = density (the height of the PDF at a given value)
  • p = probability (the CDF—cumulative probability up to a given value)
  • q = quantile (the inverse of the CDF—find the value for a given probability)
  • r = random (generate random draws from the distribution)

For example, for the normal distribution: dnorm(), pnorm(), qnorm(), rnorm().

D.9.1 dnorm() and dt()

Compute the probability density function (PDF) for the normal or t-distribution. Useful for plotting distribution curves.

Arguments:

  • x: Value(s) at which to evaluate the density
  • mean, sd: Parameters of the normal distribution (for dnorm())
  • df: Degrees of freedom (for dt())

Example:

# Height of the standard normal PDF at x = 0
dnorm(0, mean = 0, sd = 1)
[1] 0.3989423
# Height of the t-distribution PDF at x = 2 with 30 df
dt(2, df = 30)
[1] 0.05685228

D.9.2 pt()

Compute the cumulative distribution function (CDF) for the t-distribution. This gives you the probability that a t-distributed random variable is less than or equal to a given value. Essential for computing p-values.

Arguments:

  • q: The t-value(s) to evaluate
  • df: Degrees of freedom
  • lower.tail: If TRUE (default), returns \(P(T \leq q)\); if FALSE, returns \(P(T > q)\)

Example:

# P-value for a two-sided test with t = 2.3 and 100 df
2 * pt(abs(2.3), df = 100, lower.tail = FALSE)
[1] 0.0235262

D.9.3 qt()

Compute the quantile function (inverse CDF) for the t-distribution. Given a probability, it returns the corresponding t-value. Used to find critical values for hypothesis tests.

Arguments:

  • p: Probability (between 0 and 1)
  • df: Degrees of freedom
  • lower.tail: If TRUE (default), finds the value where \(P(T \leq q) = p\); if FALSE, finds the value where \(P(T > q) = p\)

Example:

# Critical value for a two-sided 5% test with 100 df
# We want the value that leaves 2.5% in the upper tail
qt(0.025, df = 100, lower.tail = FALSE)
[1] 1.983972

D.9.4 qf()

Compute the quantile function for the F-distribution. Used to find critical values for F-tests.

Arguments:

  • p: Probability
  • df1: Numerator degrees of freedom (number of restrictions)
  • df2: Denominator degrees of freedom (from the unrestricted model)
  • lower.tail: If FALSE, returns the value where \(P(F > q) = p\)

Example:

# Critical value for an F-test with 2 and 494 degrees of freedom at 5%
qf(0.05, df1 = 2, df2 = 494, lower.tail = FALSE)
[1] 3.013973

D.9.5 runif()

Generate random draws from a uniform distribution.

Arguments:

  • n: Number of values to generate
  • min: Minimum value (default: 0)
  • max: Maximum value (default: 1)

Example:

set.seed(123)
runif(5, min = 1, max = 10)
[1] 3.588198 8.094746 4.680792 8.947157 9.464206

D.9.6 rexp()

Generate random draws from an exponential distribution.

Arguments:

  • n: Number of values to generate
  • rate: Rate parameter \(\lambda\) (default: 1). The mean of the distribution is \(1/\lambda\).

Example:

set.seed(123)
rexp(5, rate = 0.5)  # Mean = 1/0.5 = 2
[1] 1.68691452 1.15322054 2.65810974 0.06315472 0.11242195

D.9.7 rbinom()

Generate random draws from a binomial distribution. Useful for simulating binary outcomes (e.g., treatment assignment).

Arguments:

  • n: Number of values to generate
  • size: Number of trials
  • prob: Probability of success on each trial

Example:

set.seed(123)
# Simulate 10 coin flips (1 = heads, 0 = tails)
rbinom(10, size = 1, prob = 0.5)
 [1] 0 1 0 1 1 0 1 1 1 0

D.9.8 rt()

Generate random draws from a t-distribution.

Arguments:

  • n: Number of values to generate
  • df: Degrees of freedom

Example:

set.seed(123)
rt(5, df = 30)
[1] -0.5878234 -1.4779045 -0.1125616 -1.4142351  1.6124113

D.10 Model Diagnostic Functions

D.10.1 anova()

Compare nested models using an F-test. Pass the restricted (smaller) model first, then the unrestricted (larger) model. Tests whether the additional variables in the unrestricted model are jointly significant.

Arguments:

  • object: One or more fitted model objects

Example:

reg_small <- lm(mpg ~ hp, data = mtcars)
reg_large <- lm(mpg ~ hp + wt + cyl, data = mtcars)
anova(reg_small, reg_large)
Analysis of Variance Table

Model 1: mpg ~ hp
Model 2: mpg ~ hp + wt + cyl
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     30 447.67                                  
2     28 176.62  2    271.05 21.485 2.214e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

D.10.2 nobs()

Return the number of observations used to fit a model.

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp, data = mtcars)
nobs(reg)
[1] 32

D.10.3 df.residual()

Return the residual degrees of freedom (\(n - k - 1\)) from a fitted model.

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp + wt, data = mtcars)
df.residual(reg)  # 32 - 2 - 1 = 29
[1] 29

D.10.4 resid()

Extract residuals from a fitted model. Equivalent to residuals().

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp, data = mtcars)
head(resid(reg))
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
       -1.5937500        -1.5937500        -0.9536307        -1.1937500 
Hornet Sportabout           Valiant 
        0.5410881        -4.8348913 

D.11 Additional Data Manipulation Functions

D.11.1 tibble()

Create a data frame (modern version). Works like data.frame() but with better defaults: it doesn’t convert strings to factors and prints more cleanly.

Arguments:

  • ...: Name-value pairs of columns

Example:

tibble(
  name = c("Alice", "Bob", "Carol"),
  age = c(25, 30, 35),
  income = c(50000, 60000, 70000)
)
# A tibble: 3 × 3
  name    age income
  <chr> <dbl>  <dbl>
1 Alice    25  50000
2 Bob      30  60000
3 Carol    35  70000

D.11.2 bind_rows()

Stack data frames on top of each other (by rows).

Arguments:

  • ...: Data frames to bind together

Example:

df1 <- tibble(x = 1:3, y = c("a", "b", "c"))
df2 <- tibble(x = 4:6, y = c("d", "e", "f"))
bind_rows(df1, df2)
# A tibble: 6 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    
5     5 e    
6     6 f    

D.11.3 pivot_longer()

Reshape data from wide format to long format. Useful for making data “tidy” for ggplot2.

Arguments:

  • data: A data frame
  • cols: Columns to pivot into longer format
  • names_to: Name of the new column that will contain the old column names
  • values_to: Name of the new column that will contain the values

Example:

wide_data <- tibble(
  id = 1:3,
  score_2020 = c(80, 90, 85),
  score_2021 = c(85, 92, 88)
)

wide_data |>
  pivot_longer(
    cols = starts_with("score"),
    names_to = "year",
    values_to = "score"
  )
# A tibble: 6 × 3
     id year       score
  <int> <chr>      <dbl>
1     1 score_2020    80
2     1 score_2021    85
3     2 score_2020    90
4     2 score_2021    92
5     3 score_2020    85
6     3 score_2021    88

D.11.4 slice_sample()

Randomly sample rows from a data frame.

Arguments:

  • .data: A data frame
  • n: Number of rows to sample
  • replace: Sample with replacement? (default: FALSE)

Example:

set.seed(123)
mtcars |>
  slice_sample(n = 5)
                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Maserati Bora      15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Honda Civic        30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Merc 450SLC        15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Datsun 710         22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1

D.11.5 map_dfr() (purrr)

Apply a function to each element of a list or vector and combine the results into a single data frame by binding rows. Useful for running simulations.

Arguments:

  • .x: A list or vector to iterate over
  • .f: A function to apply to each element

Example:

# Run 5 simulations and collect results
map_dfr(1:5, function(i) {
  x <- rnorm(50)
  tibble(sim = i, mean_x = mean(x))
})
# A tibble: 5 × 2
    sim    mean_x
  <int>     <dbl>
1     1 -0.0600  
2     2  0.0103  
3     3  0.0978  
4     4 -0.000785
5     5 -0.0664  

D.12 Additional ggplot2 Functions

D.12.1 annotate()

Add text, labels, or shapes to a plot at specific coordinates. Unlike geom_text(), annotate() is for adding single annotations rather than mapping data to text.

Arguments:

  • geom: Type of annotation (e.g., "text", "rect", "segment")
  • x, y: Position of the annotation
  • label: Text to display (for "text" geom)
  • parse: If TRUE, interpret the label as a plotmath expression (default: FALSE)
  • color, size: Styling options

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  annotate("text", x = 300, y = 30,
           label = "hat(beta)[1] == -0.068",
           parse = TRUE, color = "red", size = 5)


D.12.2 stat_function()

Overlay a mathematical function on a ggplot. Useful for plotting theoretical distributions on top of histograms.

Arguments:

  • fun: The function to plot (e.g., dnorm, dt)
  • args: A list of additional arguments to pass to the function
  • color, linewidth: Styling options

Example:

ggplot(data.frame(x = rnorm(500)), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30, fill = "lightblue", color = "black") +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
                color = "red", linewidth = 1.2)


D.12.3 geom_segment()

Draw line segments between specified start and end points. Useful for adding arrows, error bars, or connecting points.

Arguments:

  • aes(x, y, xend, yend): Start and end coordinates
  • arrow: Add arrowheads with arrow()
  • color, linewidth, linetype: Styling options

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_segment(aes(x = 150, y = 25, xend = 150, yend = 15),
               arrow = arrow(length = unit(0.2, "cm")),
               color = "red", linewidth = 1)
Warning in geom_segment(aes(x = 150, y = 25, xend = 150, yend = 15), arrow = arrow(length = unit(0.2, : All aesthetics have length 1, but the data has 32 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.


D.12.4 geom_ribbon() and geom_area()

Shade a region between a ymin and ymax. geom_area() is a special case where ymin = 0. Useful for shading regions under distribution curves (e.g., p-values, rejection regions).

Arguments:

  • aes(x, ymin, ymax): Boundaries of the shaded region
  • fill: Fill color
  • alpha: Transparency

Example:

shade_data <- tibble(
  x = seq(-3, 3, length.out = 200),
  y = dnorm(x)
)

ggplot(shade_data, aes(x = x)) +
  geom_line(aes(y = y)) +
  geom_ribbon(data = shade_data |> filter(x >= 1.96),
              aes(ymin = 0, ymax = y),
              fill = "red", alpha = 0.5)


D.12.5 geom_errorbar()

Add error bars to a plot. Can be vertical (default) or horizontal (with orientation = "y"). Commonly used for confidence interval plots.

Arguments:

  • aes(ymin, ymax): Lower and upper bounds (vertical), or aes(xmin, xmax) with orientation = "y" (horizontal)
  • width: Width of the error bar caps
  • orientation: Set to "y" for horizontal error bars

Example:

ci_data <- tibble(
  variable = c("hp", "wt", "cyl"),
  estimate = c(-0.03, -3.8, -1.5),
  lower = c(-0.05, -5.1, -2.8),
  upper = c(-0.01, -2.5, -0.2)
)

ggplot(ci_data, aes(y = variable, x = estimate)) +
  geom_point(size = 3) +
  geom_errorbar(aes(xmin = lower, xmax = upper),
                width = 0.2, orientation = "y") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_minimal()


D.12.6 geom_density()

Plot a smoothed density estimate. An alternative to histograms for visualizing distributions.

Arguments:

  • fill: Fill color
  • alpha: Transparency
  • color: Line color

Example:

ggplot(mtcars, aes(x = mpg)) +
  geom_density(fill = "steelblue", alpha = 0.5)


D.12.7 scale_color_manual() and scale_fill_manual()

Manually set colors for categorical variables mapped to color or fill.

Arguments:

  • values: A named vector of colors
  • labels: Optional labels for the legend

Example:

ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
  geom_point(size = 2) +
  scale_color_manual(values = c("4" = "steelblue", "6" = "orange", "8" = "red"))


D.12.8 coord_cartesian()

Zoom into a region of the plot without dropping data points (unlike xlim()/ylim(), which remove data outside the range).

Arguments:

  • xlim: Range for x-axis as c(min, max)
  • ylim: Range for y-axis as c(min, max)

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  coord_cartesian(xlim = c(50, 200), ylim = c(15, 35))


D.13 Quick Reference Tables

D.13.1 Descriptive Statistics Functions

Table D.1: Descriptive statistics functions
Function Purpose Key Argument
mean() Average na.rm = TRUE
sd() Standard deviation na.rm = TRUE
median() Middle value na.rm = TRUE
min(), max() Extremes na.rm = TRUE
sum() Total na.rm = TRUE
var() Variance na.rm = TRUE
cor() Correlation use = "complete.obs"
summary() Multiple stats

D.13.2 Data Manipulation Functions (dplyr)

Table D.2: Data manipulation functions
Function Purpose Example
select() Choose columns select(data, col1, col2)
filter() Choose rows filter(data, x > 5)
mutate() Create/modify variables mutate(data, new = x * 2)
group_by() Group data group_by(data, category)
summarize() Aggregate summarize(data, avg = mean(x))
count() Count rows count(data, category)
arrange() Sort rows arrange(data, desc(x))

D.13.3 Regression Functions

Table D.3: Regression functions
Function Purpose Package
lm() OLS regression base R
summary() Model results base R
coef() Coefficients base R
confint() Confidence intervals base R
predict() Fitted values base R
residuals() / resid() Residuals base R
anova() F-test (compare models) base R
nobs() Number of observations base R
df.residual() Residual degrees of freedom base R
feols() Fixed effects + robust SE fixest
modelsummary() Formatted tables modelsummary

D.13.4 Statistical Distribution Functions

Table D.4: Statistical distribution functions
Function Purpose Example
dnorm(), dt() Density (PDF height) dnorm(0), dt(2, df=30)
pt() CDF for t-distribution pt(2.3, df=100)
qt() Critical values (t-dist) qt(0.025, df=100)
qf() Critical values (F-dist) qf(0.05, df1=2, df2=494)
rnorm() Random normal draws rnorm(100, mean=0, sd=1)
runif() Random uniform draws runif(100, min=0, max=1)
rbinom() Random binomial draws rbinom(100, size=1, prob=0.5)
rexp() Random exponential draws rexp(100, rate=0.5)
rt() Random t-dist draws rt(100, df=30)

D.13.5 ggplot2 Geometries

Table D.5: ggplot2 geometries
Function Plot Type
geom_point() Scatterplot
geom_line() Line plot
geom_histogram() Histogram
geom_density() Smoothed density
geom_boxplot() Boxplot
geom_bar() Bar chart (counts)
geom_col() Bar chart (values)
geom_smooth() Smoothed line/regression
geom_segment() Line segments/arrows
geom_ribbon() / geom_area() Shaded regions
geom_errorbar() Error bars / CIs
geom_hline() Horizontal line
geom_vline() Vertical line
annotate() Text/shape annotations
stat_function() Overlay math functions