Appendix D — R Functions and Packages Reference

This appendix provides a quick reference for all the R functions and packages used throughout this book. Use this as a handy guide when you need to remember the syntax or arguments for a particular function.

D.1 Packages Used in This Book

D.1.1 tidyverse

The tidyverse is a collection of R packages designed for data science. When you load tidyverse, it automatically loads several packages including ggplot2, dplyr, tidyr, readr, and others.

What it’s used for:

  • Data manipulation and transformation (dplyr)
  • Data visualization (ggplot2)
  • Reshaping data (tidyr)
  • Reading data files (readr)

Loading:

library(tidyverse)

D.1.2 modelsummary

The modelsummary package creates publication-quality tables for regression results and descriptive statistics.

What it’s used for:

  • Creating formatted regression tables comparing multiple models
  • Generating descriptive statistics tables
  • Exporting tables to Word, LaTeX, or HTML

Loading:

library(modelsummary)

D.1.3 palmerpenguins

A dataset package containing measurements of penguins from Palmer Station, Antarctica. Great for learning data visualization and exploration.

What it’s used for:

  • Practice dataset for learning R
  • Data visualization examples
  • Regression examples

Loading:

library(palmerpenguins)
data(penguins)

D.1.4 wooldridge

Contains all datasets from Wooldridge’s Introductory Econometrics textbook.

What it’s used for:

  • Real-world econometric datasets
  • Regression examples (wages, crime, etc.)

Loading:

library(wooldridge)
data(wage1)  # Example: load the wage1 dataset

D.1.5 fixest

A fast and powerful package for fixed effects and panel data regression.

What it’s used for:

  • Panel data regression with fixed effects
  • Clustered standard errors
  • Very large datasets (faster than lm())

Loading:

library(fixest)

D.1.6 lmtest

Provides diagnostic tests for linear regression models.

What it’s used for:

  • Breusch-Pagan test for heteroskedasticity
  • Other specification tests

Loading:

library(lmtest)

D.1.7 sandwich

Provides robust covariance matrix estimators (heteroskedasticity-consistent standard errors).

What it’s used for:

  • Robust standard errors
  • Heteroskedasticity-consistent (HC) standard errors

Loading:

library(sandwich)

D.1.8 patchwork

Easily combine multiple ggplot2 plots into a single figure.

What it’s used for:

  • Arranging multiple plots side by side
  • Creating multi-panel figures

Loading:

library(patchwork)

D.2 Data Import and Export Functions

D.2.1 read_csv() (readr)

Read a comma-separated values (CSV) file into a tibble.

Arguments:

  • file: Path to the CSV file
  • col_types: Optional column type specification
  • skip: Number of rows to skip before reading
  • na: Character vector of strings to interpret as NA

Example:

library(tidyverse)
df <- read_csv("my_data.csv")

D.2.2 read_excel() (readxl)

Read an Excel file (.xlsx or .xls) into a tibble.

Arguments:

  • path: Path to the Excel file
  • sheet: Sheet to read (name or number)
  • skip: Number of rows to skip

Example:

library(readxl)
df <- read_excel("my_data.xlsx", sheet = 1)

D.2.3 read_dta() (haven)

Read a Stata .dta file into a tibble.

Arguments:

  • file: Path to the .dta file

Example:

library(haven)
df <- read_dta("my_data.dta")

D.2.4 read_rds()

Read an R data file (.rds) into R. Preserves all R data types.

Arguments:

  • file: Path to the .rds file

Example:

df <- read_rds("my_data.rds")

D.2.5 write_csv()

Write a data frame to a CSV file.

Arguments:

  • x: A data frame
  • file: Path for the output file

Example:

write_csv(df, "clean_data.csv")

D.2.6 write_rds()

Save an R object to an .rds file.

Arguments:

  • x: An R object
  • file: Path for the output file

Example:

write_rds(df, "clean_data.rds")

D.2.7 write_dta() (haven)

Write a data frame to a Stata .dta file.

Arguments:

  • data: A data frame
  • path: Path for the output file

Example:

library(haven)
write_dta(df, "clean_data.dta")

D.3 Data Inspection Functions

D.3.2 glimpse()

Get a compact overview of a data frame’s structure and column types.

Arguments:

  • x: A data frame

Example:

glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

D.3.3 summary()

Generate summary statistics for each variable.

Arguments:

  • object: A data frame, vector, or model object

Example:

summary(mtcars$mpg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 

D.3.4 nrow() and ncol()

Return the number of rows or columns in a data frame.

Arguments:

  • x: A data frame or matrix

Example:

nrow(mtcars)
[1] 32
ncol(mtcars)
[1] 11

D.3.5 colnames()

Return or set the column names of a data frame.

Arguments:

  • x: A data frame or matrix

Example:

colnames(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

D.3.6 class() and typeof()

Return the class or underlying type of an object.

Arguments:

  • x: Any R object

Example:

class(mtcars$mpg)
[1] "numeric"
typeof(mtcars$mpg)
[1] "double"

D.4 Descriptive Statistics Functions

D.4.1 mean()

Calculate the arithmetic mean.

Arguments:

  • x: A numeric vector
  • na.rm: Remove missing values? (default: FALSE)

Example:

x <- c(10, 20, 30, NA, 50)
mean(x, na.rm = TRUE)
[1] 27.5

D.4.2 sd()

Calculate the standard deviation.

Arguments:

  • x: A numeric vector
  • na.rm: Remove missing values? (default: FALSE)

Example:

sd(mtcars$mpg)
[1] 6.026948

D.4.3 median()

Calculate the median (middle value).

Arguments:

  • x: A numeric vector
  • na.rm: Remove missing values? (default: FALSE)

Example:

median(mtcars$mpg)
[1] 19.2

D.4.4 min() and max()

Find the minimum or maximum value.

Arguments:

  • ...: Numeric vectors
  • na.rm: Remove missing values? (default: FALSE)

Example:

min(mtcars$mpg)
[1] 10.4
max(mtcars$mpg)
[1] 33.9

D.4.5 sum()

Calculate the sum of all values.

Arguments:

  • ...: Numeric vectors
  • na.rm: Remove missing values? (default: FALSE)

Example:

sum(mtcars$mpg)
[1] 642.9

D.4.6 var()

Calculate the variance.

Arguments:

  • x: A numeric vector
  • na.rm: Remove missing values? (default: FALSE)

Example:

var(mtcars$mpg)
[1] 36.3241

D.4.7 cor()

Calculate the correlation between two variables.

Arguments:

  • x, y: Numeric vectors
  • use: How to handle missing values (e.g., "complete.obs")

Example:

cor(mtcars$mpg, mtcars$hp)
[1] -0.7761684

D.5 Data Manipulation Functions (dplyr)

D.5.1 select()

Choose which columns to keep.

Arguments:

  • .data: A data frame
  • ...: Column names or selection helpers

Example:

mtcars |>
  select(mpg, cyl, hp) |>
  head(3)
               mpg cyl  hp
Mazda RX4     21.0   6 110
Mazda RX4 Wag 21.0   6 110
Datsun 710    22.8   4  93

D.5.2 filter()

Keep rows that meet a condition.

Arguments:

  • .data: A data frame
  • ...: Logical conditions

Example:

mtcars |>
  filter(mpg > 25) |>
  head(3)
                mpg cyl disp hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1

D.5.3 mutate()

Create new variables or modify existing ones.

Arguments:

  • .data: A data frame
  • ...: Name-value pairs of expressions

Example:

mtcars |>
  mutate(kpl = mpg * 0.425) |>  # Convert to km per liter
  select(mpg, kpl) |>
  head(3)
               mpg   kpl
Mazda RX4     21.0 8.925
Mazda RX4 Wag 21.0 8.925
Datsun 710    22.8 9.690

D.5.4 group_by()

Group data by one or more variables (usually followed by summarize()).

Arguments:

  • .data: A data frame
  • ...: Variables to group by

Example:

mtcars |>
  group_by(cyl) |>
  summarize(avg_mpg = mean(mpg))
# A tibble: 3 × 2
    cyl avg_mpg
  <dbl>   <dbl>
1     4    26.7
2     6    19.7
3     8    15.1

D.5.5 summarize() / summarise()

Calculate summary statistics for each group.

Arguments:

  • .data: A data frame (usually grouped)
  • ...: Name-value pairs of summary functions

Example:

mtcars |>
  group_by(cyl) |>
  summarize(
    avg_mpg = mean(mpg),
    sd_mpg = sd(mpg),
    n = n()
  )
# A tibble: 3 × 4
    cyl avg_mpg sd_mpg     n
  <dbl>   <dbl>  <dbl> <int>
1     4    26.7   4.51    11
2     6    19.7   1.45     7
3     8    15.1   2.56    14

D.5.6 count()

Count observations by group.

Arguments:

  • x: A data frame
  • ...: Variables to count by
  • sort: Sort by count? (default: FALSE)

Example:

mtcars |>
  count(cyl)
  cyl  n
1   4 11
2   6  7
3   8 14

D.5.7 arrange()

Sort rows by one or more variables.

Arguments:

  • .data: A data frame
  • ...: Variables to sort by (use desc() for descending)

Example:

mtcars |>
  arrange(desc(mpg)) |>
  head(3)
                mpg cyl disp hp drat    wt  qsec vs am gear carb
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2

D.5.8 case_when()

Vectorized if-else for creating categorical variables.

Arguments:

  • ...: A sequence of two-sided formulas: condition ~ value

Example:

mtcars |>
  mutate(
    mpg_category = case_when(
      mpg < 15 ~ "Low",
      mpg < 25 ~ "Medium",
      TRUE ~ "High"
    )
  ) |>
  count(mpg_category)
  mpg_category  n
1         High  6
2          Low  5
3       Medium 21

D.5.9 ifelse()

Simple conditional: if condition is TRUE, return one value; otherwise, another.

Arguments:

  • test: A logical condition
  • yes: Value if TRUE
  • no: Value if FALSE

Example:

mtcars |>
  mutate(efficient = ifelse(mpg > 20, 1, 0)) |>
  count(efficient)
  efficient  n
1         0 18
2         1 14

D.5.10 n()

Count the number of observations in the current group (used inside summarize()).

Arguments: None

Example:

mtcars |>
  group_by(cyl) |>
  summarize(count = n())
# A tibble: 3 × 2
    cyl count
  <dbl> <int>
1     4    11
2     6     7
3     8    14

D.5.11 rename()

Rename columns in a data frame.

Arguments:

  • .data: A data frame
  • ...: New name = old name pairs

Example:

mtcars |>
  rename(miles_per_gallon = mpg, horsepower = hp) |>
  head(3)
              miles_per_gallon cyl disp horsepower drat    wt  qsec vs am gear
Mazda RX4                 21.0   6  160        110 3.90 2.620 16.46  0  1    4
Mazda RX4 Wag             21.0   6  160        110 3.90 2.875 17.02  0  1    4
Datsun 710                22.8   4  108         93 3.85 2.320 18.61  1  1    4
              carb
Mazda RX4        4
Mazda RX4 Wag    4
Datsun 710       1

D.5.12 slice()

Select rows by position.

Arguments:

  • .data: A data frame
  • ...: Integer row positions

Example:

mtcars |>
  slice(1:5)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

D.5.13 across()

Apply a function to multiple columns at once (used inside mutate() or summarize()).

Arguments:

  • .cols: Columns to transform (use tidy-select helpers like where(), starts_with())
  • .fns: Function(s) to apply

Example:

mtcars |>
  summarize(across(c(mpg, hp, wt), mean))
       mpg       hp      wt
1 20.09062 146.6875 3.21725

D.5.14 starts_with(), ends_with(), contains()

Tidy-select helpers for choosing columns by name patterns. Used inside select(), across(), pivot_longer(), and other tidyverse functions.

Example:

# Select columns whose names start with "s"
mtcars |> select(starts_with("d"))

# Select columns whose names contain "a"
mtcars |> select(contains("a"))

D.5.15 where()

A tidy-select helper that selects columns based on a function that returns TRUE or FALSE. Commonly used with across().

Example:

# Round all numeric columns to 1 decimal place
mtcars |>
  mutate(across(where(is.numeric), \(x) round(x, 1))) |>
  head(3)
               mpg cyl disp  hp drat  wt qsec vs am gear carb
Mazda RX4     21.0   6  160 110  3.9 2.6 16.5  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110  3.9 2.9 17.0  0  1    4    4
Datsun 710    22.8   4  108  93  3.9 2.3 18.6  1  1    4    1

D.5.16 everything()

A tidy-select helper that selects all columns. Useful for reordering columns.

Example:

# Move "am" to the front, keep everything else after it
mtcars |> select(am, everything())

D.5.17 ungroup()

Remove grouping from a grouped data frame.

Arguments:

  • x: A grouped data frame

Example:

mtcars |>
  group_by(cyl) |>
  mutate(avg_mpg = mean(mpg)) |>
  ungroup() |>
  head(3)
# A tibble: 3 × 12
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb avg_mpg
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4    19.7
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4    19.7
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1    26.7

D.6 Join Functions (dplyr)

D.6.1 left_join()

Merge two data frames, keeping all rows from the left (first) dataset and adding matching columns from the right (second) dataset. Non-matching rows get NA.

Arguments:

  • x: Left data frame (all rows kept)
  • y: Right data frame
  • by: Column(s) to match on

Example:

students <- tibble(id = 1:4, name = c("Alice", "Bob", "Carol", "Dan"))
grades   <- tibble(id = c(1, 2, 3, 5), grade = c("A", "B", "A", "C"))

left_join(students, grades, by = "id")
# A tibble: 4 × 3
     id name  grade
  <dbl> <chr> <chr>
1     1 Alice A    
2     2 Bob   B    
3     3 Carol A    
4     4 Dan   <NA> 

D.6.2 right_join()

Merge two data frames, keeping all rows from the right (second) dataset.

Arguments:

  • x: Left data frame
  • y: Right data frame (all rows kept)
  • by: Column(s) to match on

Example:

right_join(students, grades, by = "id")
# A tibble: 4 × 3
     id name  grade
  <dbl> <chr> <chr>
1     1 Alice A    
2     2 Bob   B    
3     3 Carol A    
4     5 <NA>  C    

D.6.3 inner_join()

Merge two data frames, keeping only rows that have matches in both datasets.

Arguments:

  • x, y: Data frames
  • by: Column(s) to match on

Example:

inner_join(students, grades, by = "id")
# A tibble: 3 × 3
     id name  grade
  <dbl> <chr> <chr>
1     1 Alice A    
2     2 Bob   B    
3     3 Carol A    

D.6.4 full_join()

Merge two data frames, keeping all rows from both datasets. Missing values filled with NA.

Arguments:

  • x, y: Data frames
  • by: Column(s) to match on

Example:

full_join(students, grades, by = "id")
# A tibble: 5 × 3
     id name  grade
  <dbl> <chr> <chr>
1     1 Alice A    
2     2 Bob   B    
3     3 Carol A    
4     4 Dan   <NA> 
5     5 <NA>  C    

D.7 Reshaping Functions (tidyr)

D.7.1 pivot_longer()

Reshape data from wide format to long format.

Arguments:

  • data: A data frame
  • cols: Columns to pivot into longer format
  • names_to: Name of the new column for the old column names
  • values_to: Name of the new column for the values

Example:

wide_data <- tibble(
  id = 1:3,
  score_2020 = c(80, 90, 85),
  score_2021 = c(85, 92, 88)
)

wide_data |>
  pivot_longer(
    cols = starts_with("score"),
    names_to = "year",
    values_to = "score"
  )
# A tibble: 6 × 3
     id year       score
  <int> <chr>      <dbl>
1     1 score_2020    80
2     1 score_2021    85
3     2 score_2020    90
4     2 score_2021    92
5     3 score_2020    85
6     3 score_2021    88

D.7.2 pivot_wider()

Reshape data from long format to wide format. The reverse of pivot_longer().

Arguments:

  • data: A data frame
  • names_from: Column whose values become new column names
  • values_from: Column whose values fill the new columns

Example:

long_data <- tibble(
  id = c(1, 1, 2, 2),
  year = c(2020, 2021, 2020, 2021),
  score = c(80, 85, 90, 92)
)

long_data |>
  pivot_wider(
    names_from = year,
    values_from = score
  )
# A tibble: 2 × 3
     id `2020` `2021`
  <dbl>  <dbl>  <dbl>
1     1     80     85
2     2     90     92

D.8 Regression Functions

D.8.1 lm()

Fit a linear model using Ordinary Least Squares (OLS).

Arguments:

  • formula: A formula like y ~ x1 + x2
  • data: A data frame

Example:

reg <- lm(mpg ~ hp + wt, data = mtcars)
summary(reg)

Call:
lm(formula = mpg ~ hp + wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,    Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

D.8.2 summary() (for regression)

Display detailed regression results including coefficients, standard errors, t-statistics, and p-values.

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp, data = mtcars)
summary(reg)

D.8.3 coef()

Extract the estimated coefficients from a model.

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp, data = mtcars)
coef(reg)
(Intercept)          hp 
30.09886054 -0.06822828 

D.8.4 confint()

Calculate confidence intervals for model coefficients.

Arguments:

  • object: A fitted model object
  • level: Confidence level (default: 0.95)

Example:

reg <- lm(mpg ~ hp, data = mtcars)
confint(reg, level = 0.95)
                  2.5 %     97.5 %
(Intercept) 26.76194879 33.4357723
hp          -0.08889465 -0.0475619

D.8.5 predict()

Generate predicted values from a fitted model.

Arguments:

  • object: A fitted model object
  • newdata: Optional data frame with new predictor values

Example:

reg <- lm(mpg ~ hp, data = mtcars)
# Predict for existing data
head(predict(reg))
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
         22.59375          22.59375          23.75363          22.59375 
Hornet Sportabout           Valiant 
         18.15891          22.93489 
# Predict for new data
predict(reg, newdata = data.frame(hp = c(100, 150, 200)))
       1        2        3 
23.27603 19.86462 16.45320 

D.8.6 residuals()

Extract residuals (prediction errors) from a fitted model.

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp, data = mtcars)
head(residuals(reg))
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
       -1.5937500        -1.5937500        -0.9536307        -1.1937500 
Hornet Sportabout           Valiant 
        0.5410881        -4.8348913 

D.8.7 feols() (fixest package)

Fit linear models with fixed effects and robust standard errors.

Arguments:

  • fml: A formula (use | for fixed effects)
  • data: A data frame
  • vcov: Type of standard errors (e.g., "HC1" for robust)

Example:

library(fixest)
reg <- feols(mpg ~ hp + wt, data = mtcars, vcov = "HC1")
summary(reg)

D.8.8 modelsummary() (modelsummary package)

Create publication-quality regression tables.

Arguments:

  • models: A model or list of models
  • stars: Show significance stars?
  • gof_map: Which goodness-of-fit statistics to show

Example:

library(modelsummary)
reg1 <- lm(mpg ~ hp, data = mtcars)
reg2 <- lm(mpg ~ hp + wt, data = mtcars)
modelsummary(list(reg1, reg2), stars = TRUE)

D.8.9 datasummary_skim() (modelsummary package)

Generate a quick descriptive statistics table for all variables in a data frame.

Arguments:

  • data: A data frame
  • type: "numeric" (default) or "categorical"

Example:

library(modelsummary)
datasummary_skim(mtcars)

D.8.10 glm()

Fit a generalized linear model. Used for logistic regression and other non-linear models.

Arguments:

  • formula: A formula like y ~ x1 + x2
  • data: A data frame
  • family: Distribution family (e.g., binomial(link = "logit") for logistic regression)

Example:

mtcars$high_mpg <- ifelse(mtcars$mpg > 20, 1, 0)
logit_reg <- glm(high_mpg ~ hp + wt, data = mtcars, family = binomial(link = "logit"))
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit_reg)

Call:
glm(formula = high_mpg ~ hp + wt, family = binomial(link = "logit"), 
    data = mtcars)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)    894.228 365884.162   0.002    0.998
hp              -2.021    858.062  -0.002    0.998
wt            -202.865  84688.218  -0.002    0.998

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.3860e+01  on 31  degrees of freedom
Residual deviance: 1.1156e-08  on 29  degrees of freedom
AIC: 6

Number of Fisher Scoring iterations: 25

D.8.11 feglm() (fixest package)

Fit generalized linear models with fixed effects. The fixest counterpart to glm().

Arguments:

  • fml: A formula (use | for fixed effects)
  • data: A data frame
  • family: Distribution family (e.g., binomial(link = "logit"))

Example:

library(fixest)
logit_fe <- feglm(outcome ~ treatment | group, data = df, family = binomial(link = "logit"))

D.8.12 fixef() (fixest package)

Extract the estimated fixed effects from a feols() or feglm() model.

Arguments:

  • x: A fixest model object

Example:

library(fixest)
reg <- feols(mpg ~ hp | cyl, data = mtcars)
fixef(reg)

D.8.13 bptest() (lmtest package)

Breusch-Pagan test for heteroskedasticity. A significant p-value suggests the residuals have non-constant variance.

Arguments:

  • formula: A fitted model object or formula

Example:

library(lmtest)
reg <- lm(mpg ~ hp + wt, data = mtcars)
bptest(reg)

D.8.14 coeftest() (lmtest package)

Re-test model coefficients with a different covariance matrix. Typically used to report robust standard errors.

Arguments:

  • x: A fitted model object
  • vcov.: A covariance matrix or function that computes one

Example:

library(lmtest)
library(sandwich)
reg <- lm(mpg ~ hp + wt, data = mtcars)
coeftest(reg, vcov. = vcovHC(reg, type = "HC1"))

D.8.15 vcovHC() (sandwich package)

Compute heteroskedasticity-consistent (robust) covariance matrix estimators.

Arguments:

  • x: A fitted model object
  • type: Type of estimator ("HC0", "HC1", "HC2", "HC3")

Example:

library(sandwich)
reg <- lm(mpg ~ hp + wt, data = mtcars)
vcovHC(reg, type = "HC1")

D.9 Visualization Functions (ggplot2)

D.9.1 ggplot()

Initialize a ggplot object.

Arguments:

  • data: A data frame
  • mapping: Aesthetic mappings created by aes()

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point()


D.9.2 aes()

Define aesthetic mappings (which variables map to x, y, color, etc.).

Arguments:

  • x, y: Variables for axes
  • color, fill: Variables for color
  • size, shape: Variables for size and shape

Example:

ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
  geom_point()


D.9.3 geom_point()

Add points to a plot (scatterplot).

Arguments:

  • size: Point size
  • color: Point color
  • alpha: Transparency (0 to 1)

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(size = 3, color = "steelblue", alpha = 0.7)


D.9.4 geom_line()

Add lines connecting points.

Arguments:

  • linewidth: Line thickness
  • color: Line color
  • linetype: Line type (e.g., “dashed”)

Example:

# Line plot (useful for time series)
df <- data.frame(x = 1:10, y = cumsum(rnorm(10)))
ggplot(df, aes(x = x, y = y)) +
  geom_line(color = "steelblue", linewidth = 1)


D.9.5 geom_histogram()

Create a histogram.

Arguments:

  • binwidth: Width of each bin
  • bins: Number of bins
  • fill: Bar fill color
  • color: Bar outline color

Example:

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 3, fill = "steelblue", color = "white")


D.9.6 geom_boxplot()

Create a boxplot.

Arguments:

  • fill: Box fill color
  • outlier.color: Color of outlier points

Example:

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "lightblue")


D.9.7 geom_bar() and geom_col()

Create bar charts. geom_bar() counts observations; geom_col() uses values directly.

Arguments:

  • fill: Bar fill color
  • stat: For geom_bar(), use "identity" to plot values directly

Example:

# geom_bar counts automatically
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar(fill = "steelblue")


D.9.8 geom_smooth()

Add a smoothed conditional mean (often a regression line).

Arguments:

  • method: Smoothing method (e.g., "lm" for linear)
  • se: Show confidence interval? (default: TRUE)
  • color: Line color

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "red")
`geom_smooth()` using formula = 'y ~ x'


D.9.9 geom_hline() and geom_vline()

Add horizontal or vertical reference lines.

Arguments:

  • yintercept / xintercept: Where to draw the line
  • linetype: Line type (e.g., "dashed")
  • color: Line color

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_hline(yintercept = mean(mtcars$mpg), linetype = "dashed", color = "red")


D.9.10 labs()

Add labels to the plot (title, axis labels, etc.).

Arguments:

  • title: Plot title
  • x, y: Axis labels
  • color, fill: Legend titles

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  labs(
    title = "Fuel Efficiency vs. Horsepower",
    x = "Horsepower",
    y = "Miles per Gallon"
  )


D.9.11 facet_wrap()

Create small multiples (separate panels for each level of a variable).

Arguments:

  • facets: A formula like ~ variable
  • nrow, ncol: Number of rows or columns
  • scales: Should scales be fixed or free?

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  facet_wrap(~ cyl, nrow = 1)


D.9.12 theme_minimal()

Apply a clean, minimal theme to the plot.

Arguments: None (or see theme() for customization)

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  theme_minimal()


D.10 Utility Functions

D.10.1 c()

Combine values into a vector.

Arguments:

  • ...: Values to combine

Example:

x <- c(1, 2, 3, 4, 5)
x
[1] 1 2 3 4 5

D.10.2 seq()

Generate a sequence of numbers.

Arguments:

  • from: Starting value
  • to: Ending value
  • by: Increment
  • length.out: Desired length of sequence

Example:

seq(from = 0, to = 10, by = 2)
[1]  0  2  4  6  8 10
seq(from = 0, to = 1, length.out = 5)
[1] 0.00 0.25 0.50 0.75 1.00

D.10.3 rep()

Repeat values.

Arguments:

  • x: Value(s) to repeat
  • times: Number of times to repeat

Example:

rep(c("A", "B"), times = 3)
[1] "A" "B" "A" "B" "A" "B"

D.10.4 sample()

Take a random sample.

Arguments:

  • x: Vector to sample from
  • size: Number of items to sample
  • replace: Sample with replacement?

Example:

set.seed(123)
sample(1:10, size = 5, replace = FALSE)
[1]  3 10  2  8  6

D.10.5 set.seed()

Set the random number seed for reproducibility.

Arguments:

  • seed: An integer

Example:

set.seed(42)
rnorm(3)  # Will always produce the same values
[1]  1.3709584 -0.5646982  0.3631284

D.10.6 rnorm()

Generate random numbers from a normal distribution.

Arguments:

  • n: Number of values to generate
  • mean: Mean of the distribution (default: 0)
  • sd: Standard deviation (default: 1)

Example:

set.seed(123)
rnorm(5, mean = 100, sd = 15)
[1]  91.59287  96.54734 123.38062 101.05763 101.93932

D.10.7 factor()

Create a factor (categorical variable).

Arguments:

  • x: A vector
  • levels: The allowed levels (in order)
  • labels: Labels for the levels

Example:

education <- c("HS", "College", "HS", "Graduate")
factor(education, levels = c("HS", "College", "Graduate"))
[1] HS       College  HS       Graduate
Levels: HS College Graduate

D.10.8 as.factor(), as.numeric(), as.character()

Convert objects to a different type.

Arguments:

  • x: Object to convert

Example:

x <- c("1", "2", "3")
as.numeric(x)
[1] 1 2 3

D.10.9 is.na()

Check for missing values.

Arguments:

  • x: A vector or data frame

Example:

x <- c(1, 2, NA, 4)
is.na(x)
[1] FALSE FALSE  TRUE FALSE
sum(is.na(x))  # Count missing values
[1] 1

D.10.10 library()

Load an installed package.

Arguments:

  • package: Name of the package (unquoted)

Example:

library(tidyverse)

D.10.11 install.packages()

Install a package from CRAN (run once, not in scripts).

Arguments:

  • pkgs: Package name (quoted)

Example:

install.packages("tidyverse")

D.10.12 data.frame()

Create a data frame (base R version). See also tibble() for the tidyverse alternative.

Arguments:

  • ...: Name-value pairs of columns

Example:

data.frame(
  x = 1:3,
  y = c("a", "b", "c")
)
  x y
1 1 a
2 2 b
3 3 c

D.10.13 names()

Get or set the names of an object (column names for data frames, element names for vectors and lists).

Arguments:

  • x: An R object

Example:

x <- c(a = 1, b = 2, c = 3)
names(x)
[1] "a" "b" "c"

D.10.14 length()

Return the number of elements in a vector or list.

Arguments:

  • x: A vector or list

Example:

x <- c(10, 20, 30, 40)
length(x)
[1] 4

D.10.15 unique()

Return the unique values in a vector, removing duplicates.

Arguments:

  • x: A vector

Example:

x <- c(1, 2, 2, 3, 3, 3)
unique(x)
[1] 1 2 3

D.10.16 which()

Return the indices of elements that satisfy a condition.

Arguments:

  • x: A logical vector

Example:

x <- c(5, 12, 3, 18, 7)
which(x > 10)
[1] 2 4

D.10.17 table()

Build a frequency table (counts of each unique value).

Arguments:

  • ...: One or more vectors

Example:

table(mtcars$cyl)

 4  6  8 
11  7 14 

D.10.18 paste() and paste0()

Concatenate strings. paste() separates with a space by default; paste0() uses no separator.

Arguments:

  • ...: Strings or vectors to concatenate
  • sep: Separator between elements (default " " for paste(), "" for paste0())
  • collapse: Optional string to collapse a vector into a single string

Example:

paste("Year", 2020)
[1] "Year 2020"
paste0("x", 1:3)
[1] "x1" "x2" "x3"
paste(c("a", "b", "c"), collapse = ", ")
[1] "a, b, c"

D.10.19 sprintf()

Format strings with placeholders. Useful for inserting numbers into text with specific formatting.

Arguments:

  • fmt: A format string with %s (string), %d (integer), %f (decimal) placeholders
  • ...: Values to insert

Example:

sprintf("The coefficient is %.3f (p = %.4f)", -0.068, 0.0023)
[1] "The coefficient is -0.068 (p = 0.0023)"

D.10.20 format()

Format numbers for display with control over decimal places, significant digits, and separators.

Arguments:

  • x: A number or vector
  • digits: Number of significant digits
  • nsmall: Minimum number of digits after the decimal
  • big.mark: Thousands separator

Example:

format(123456.789, big.mark = ",", nsmall = 2)
[1] "123,456.79"

D.10.22 log() and exp()

Natural logarithm and exponential functions. log() computes \(\ln(x)\); exp() computes \(e^x\).

Arguments:

  • x: A numeric vector
  • base: Base of the logarithm (default: \(e\), use base = 10 for \(\log_{10}\))

Example:

log(100)        # Natural log
[1] 4.60517
log(100, base = 10)  # Log base 10
[1] 2
exp(1)          # e^1 = 2.718...
[1] 2.718282

D.10.23 sqrt()

Compute the square root.

Arguments:

  • x: A numeric vector

Example:

sqrt(144)
[1] 12

D.10.24 abs()

Compute the absolute value.

Arguments:

  • x: A numeric vector

Example:

abs(c(-3, -1, 0, 2, 5))
[1] 3 1 0 2 5

D.10.25 round()

Round a number to a specified number of decimal places.

Arguments:

  • x: A numeric vector
  • digits: Number of decimal places (default: 0)

Example:

round(3.14159, digits = 2)
[1] 3.14

D.10.26 cumsum()

Compute the cumulative sum of a vector.

Arguments:

  • x: A numeric vector

Example:

cumsum(c(1, 2, 3, 4, 5))
[1]  1  3  6 10 15

D.11 The Pipe Operator: |>

The pipe operator takes the output from the left side and passes it as the first argument to the function on the right. This allows you to chain operations together in a readable way.

Example without pipe:

# Nested functions are hard to read
summary(select(filter(mtcars, mpg > 20), mpg, hp))

Example with pipe:

# Piped version is much clearer
mtcars |>
  filter(mpg > 20) |>
  select(mpg, hp) |>
  summary()
      mpg              hp       
 Min.   :21.00   Min.   : 52.0  
 1st Qu.:21.43   1st Qu.: 66.0  
 Median :23.60   Median : 94.0  
 Mean   :25.48   Mean   : 88.5  
 3rd Qu.:29.62   3rd Qu.:109.8  
 Max.   :33.90   Max.   :113.0  
TipKeyboard Shortcut

In RStudio, type Cmd/Ctrl + Shift + M to insert the pipe operator.


D.12 Statistical Distribution Functions

R has a consistent naming convention for distribution functions. Each distribution has four functions, prefixed by a letter:

  • d = density (the height of the PDF at a given value)
  • p = probability (the CDF—cumulative probability up to a given value)
  • q = quantile (the inverse of the CDF—find the value for a given probability)
  • r = random (generate random draws from the distribution)

For example, for the normal distribution: dnorm(), pnorm(), qnorm(), rnorm().

D.12.1 dnorm() and dt()

Compute the probability density function (PDF) for the normal or t-distribution. Useful for plotting distribution curves.

Arguments:

  • x: Value(s) at which to evaluate the density
  • mean, sd: Parameters of the normal distribution (for dnorm())
  • df: Degrees of freedom (for dt())

Example:

# Height of the standard normal PDF at x = 0
dnorm(0, mean = 0, sd = 1)
[1] 0.3989423
# Height of the t-distribution PDF at x = 2 with 30 df
dt(2, df = 30)
[1] 0.05685228

D.12.2 pt()

Compute the cumulative distribution function (CDF) for the t-distribution. This gives you the probability that a t-distributed random variable is less than or equal to a given value. Essential for computing p-values.

Arguments:

  • q: The t-value(s) to evaluate
  • df: Degrees of freedom
  • lower.tail: If TRUE (default), returns \(P(T \leq q)\); if FALSE, returns \(P(T > q)\)

Example:

# P-value for a two-sided test with t = 2.3 and 100 df
2 * pt(abs(2.3), df = 100, lower.tail = FALSE)
[1] 0.0235262

D.12.3 qt()

Compute the quantile function (inverse CDF) for the t-distribution. Given a probability, it returns the corresponding t-value. Used to find critical values for hypothesis tests.

Arguments:

  • p: Probability (between 0 and 1)
  • df: Degrees of freedom
  • lower.tail: If TRUE (default), finds the value where \(P(T \leq q) = p\); if FALSE, finds the value where \(P(T > q) = p\)

Example:

# Critical value for a two-sided 5% test with 100 df
# We want the value that leaves 2.5% in the upper tail
qt(0.025, df = 100, lower.tail = FALSE)
[1] 1.983972

D.12.4 qf()

Compute the quantile function for the F-distribution. Used to find critical values for F-tests.

Arguments:

  • p: Probability
  • df1: Numerator degrees of freedom (number of restrictions)
  • df2: Denominator degrees of freedom (from the unrestricted model)
  • lower.tail: If FALSE, returns the value where \(P(F > q) = p\)

Example:

# Critical value for an F-test with 2 and 494 degrees of freedom at 5%
qf(0.05, df1 = 2, df2 = 494, lower.tail = FALSE)
[1] 3.013973

D.12.5 runif()

Generate random draws from a uniform distribution.

Arguments:

  • n: Number of values to generate
  • min: Minimum value (default: 0)
  • max: Maximum value (default: 1)

Example:

set.seed(123)
runif(5, min = 1, max = 10)
[1] 3.588198 8.094746 4.680792 8.947157 9.464206

D.12.6 rexp()

Generate random draws from an exponential distribution.

Arguments:

  • n: Number of values to generate
  • rate: Rate parameter \(\lambda\) (default: 1). The mean of the distribution is \(1/\lambda\).

Example:

set.seed(123)
rexp(5, rate = 0.5)  # Mean = 1/0.5 = 2
[1] 1.68691452 1.15322054 2.65810974 0.06315472 0.11242195

D.12.7 rbinom()

Generate random draws from a binomial distribution. Useful for simulating binary outcomes (e.g., treatment assignment).

Arguments:

  • n: Number of values to generate
  • size: Number of trials
  • prob: Probability of success on each trial

Example:

set.seed(123)
# Simulate 10 coin flips (1 = heads, 0 = tails)
rbinom(10, size = 1, prob = 0.5)
 [1] 0 1 0 1 1 0 1 1 1 0

D.12.8 rt()

Generate random draws from a t-distribution.

Arguments:

  • n: Number of values to generate
  • df: Degrees of freedom

Example:

set.seed(123)
rt(5, df = 30)
[1] -0.5878234 -1.4779045 -0.1125616 -1.4142351  1.6124113

D.13 Model Diagnostic Functions

D.13.1 anova()

Compare nested models using an F-test. Pass the restricted (smaller) model first, then the unrestricted (larger) model. Tests whether the additional variables in the unrestricted model are jointly significant.

Arguments:

  • object: One or more fitted model objects

Example:

reg_small <- lm(mpg ~ hp, data = mtcars)
reg_large <- lm(mpg ~ hp + wt + cyl, data = mtcars)
anova(reg_small, reg_large)
Analysis of Variance Table

Model 1: mpg ~ hp
Model 2: mpg ~ hp + wt + cyl
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     30 447.67                                  
2     28 176.62  2    271.05 21.485 2.214e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

D.13.2 nobs()

Return the number of observations used to fit a model.

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp, data = mtcars)
nobs(reg)
[1] 32

D.13.3 df.residual()

Return the residual degrees of freedom (\(n - k - 1\)) from a fitted model.

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp + wt, data = mtcars)
df.residual(reg)  # 32 - 2 - 1 = 29
[1] 29

D.13.4 resid()

Extract residuals from a fitted model. Equivalent to residuals().

Arguments:

  • object: A fitted model object

Example:

reg <- lm(mpg ~ hp, data = mtcars)
head(resid(reg))
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
       -1.5937500        -1.5937500        -0.9536307        -1.1937500 
Hornet Sportabout           Valiant 
        0.5410881        -4.8348913 

D.14 Additional Data Manipulation Functions

D.14.1 tibble()

Create a data frame (modern version). Works like data.frame() but with better defaults: it doesn’t convert strings to factors and prints more cleanly.

Arguments:

  • ...: Name-value pairs of columns

Example:

tibble(
  name = c("Alice", "Bob", "Carol"),
  age = c(25, 30, 35),
  income = c(50000, 60000, 70000)
)
# A tibble: 3 × 3
  name    age income
  <chr> <dbl>  <dbl>
1 Alice    25  50000
2 Bob      30  60000
3 Carol    35  70000

D.14.2 bind_rows()

Stack data frames on top of each other (by rows).

Arguments:

  • ...: Data frames to bind together

Example:

df1 <- tibble(x = 1:3, y = c("a", "b", "c"))
df2 <- tibble(x = 4:6, y = c("d", "e", "f"))
bind_rows(df1, df2)
# A tibble: 6 × 2
      x y    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    
5     5 e    
6     6 f    

D.14.3 slice_sample()

Randomly sample rows from a data frame.

Arguments:

  • .data: A data frame
  • n: Number of rows to sample
  • replace: Sample with replacement? (default: FALSE)

Example:

set.seed(123)
mtcars |>
  slice_sample(n = 5)
                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb high_mpg
Maserati Bora      15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8        0
Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4        0
Honda Civic        30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2        1
Merc 450SLC        15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3        0
Datsun 710         22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1        1

D.14.4 map_dfr() (purrr)

Apply a function to each element of a list or vector and combine the results into a single data frame by binding rows. Useful for running simulations.

Arguments:

  • .x: A list or vector to iterate over
  • .f: A function to apply to each element

Example:

# Run 5 simulations and collect results
map_dfr(1:5, function(i) {
  x <- rnorm(50)
  tibble(sim = i, mean_x = mean(x))
})
# A tibble: 5 × 2
    sim    mean_x
  <int>     <dbl>
1     1 -0.0600  
2     2  0.0103  
3     3  0.0978  
4     4 -0.000785
5     5 -0.0664  

D.15 Additional ggplot2 Functions

D.15.1 annotate()

Add text, labels, or shapes to a plot at specific coordinates. Unlike geom_text(), annotate() is for adding single annotations rather than mapping data to text.

Arguments:

  • geom: Type of annotation (e.g., "text", "rect", "segment")
  • x, y: Position of the annotation
  • label: Text to display (for "text" geom)
  • parse: If TRUE, interpret the label as a plotmath expression (default: FALSE)
  • color, size: Styling options

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  annotate("text", x = 300, y = 30,
           label = "hat(beta)[1] == -0.068",
           parse = TRUE, color = "red", size = 5)


D.15.2 stat_function()

Overlay a mathematical function on a ggplot. Useful for plotting theoretical distributions on top of histograms.

Arguments:

  • fun: The function to plot (e.g., dnorm, dt)
  • args: A list of additional arguments to pass to the function
  • color, linewidth: Styling options

Example:

ggplot(data.frame(x = rnorm(500)), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30, fill = "lightblue", color = "black") +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
                color = "red", linewidth = 1.2)


D.15.3 geom_segment()

Draw line segments between specified start and end points. Useful for adding arrows, error bars, or connecting points.

Arguments:

  • aes(x, y, xend, yend): Start and end coordinates
  • arrow: Add arrowheads with arrow()
  • color, linewidth, linetype: Styling options

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_segment(aes(x = 150, y = 25, xend = 150, yend = 15),
               arrow = arrow(length = unit(0.2, "cm")),
               color = "red", linewidth = 1)
Warning in geom_segment(aes(x = 150, y = 25, xend = 150, yend = 15), arrow = arrow(length = unit(0.2, : All aesthetics have length 1, but the data has 32 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.


D.15.4 geom_ribbon() and geom_area()

Shade a region between a ymin and ymax. geom_area() is a special case where ymin = 0. Useful for shading regions under distribution curves (e.g., p-values, rejection regions).

Arguments:

  • aes(x, ymin, ymax): Boundaries of the shaded region
  • fill: Fill color
  • alpha: Transparency

Example:

shade_data <- tibble(
  x = seq(-3, 3, length.out = 200),
  y = dnorm(x)
)

ggplot(shade_data, aes(x = x)) +
  geom_line(aes(y = y)) +
  geom_ribbon(data = shade_data |> filter(x >= 1.96),
              aes(ymin = 0, ymax = y),
              fill = "red", alpha = 0.5)


D.15.5 geom_errorbar()

Add error bars to a plot. Can be vertical (default) or horizontal (with orientation = "y"). Commonly used for confidence interval plots.

Arguments:

  • aes(ymin, ymax): Lower and upper bounds (vertical), or aes(xmin, xmax) with orientation = "y" (horizontal)
  • width: Width of the error bar caps
  • orientation: Set to "y" for horizontal error bars

Example:

ci_data <- tibble(
  variable = c("hp", "wt", "cyl"),
  estimate = c(-0.03, -3.8, -1.5),
  lower = c(-0.05, -5.1, -2.8),
  upper = c(-0.01, -2.5, -0.2)
)

ggplot(ci_data, aes(y = variable, x = estimate)) +
  geom_point(size = 3) +
  geom_errorbar(aes(xmin = lower, xmax = upper),
                width = 0.2, orientation = "y") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_minimal()


D.15.6 geom_density()

Plot a smoothed density estimate. An alternative to histograms for visualizing distributions.

Arguments:

  • fill: Fill color
  • alpha: Transparency
  • color: Line color

Example:

ggplot(mtcars, aes(x = mpg)) +
  geom_density(fill = "steelblue", alpha = 0.5)


D.15.7 scale_color_manual() and scale_fill_manual()

Manually set colors for categorical variables mapped to color or fill.

Arguments:

  • values: A named vector of colors
  • labels: Optional labels for the legend

Example:

ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
  geom_point(size = 2) +
  scale_color_manual(values = c("4" = "steelblue", "6" = "orange", "8" = "red"))


D.15.8 coord_cartesian()

Zoom into a region of the plot without dropping data points (unlike xlim()/ylim(), which remove data outside the range).

Arguments:

  • xlim: Range for x-axis as c(min, max)
  • ylim: Range for y-axis as c(min, max)

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  coord_cartesian(xlim = c(50, 200), ylim = c(15, 35))


D.15.9 coord_flip()

Flip the x and y axes. Useful for making horizontal bar charts or coefficient plots.

Example:

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "lightblue") +
  coord_flip()


D.15.10 theme()

Customize individual elements of a plot’s appearance (axis text, legend position, grid lines, etc.). Use inside + like any other ggplot layer.

Arguments:

  • axis.text: Control axis tick label appearance with element_text()
  • axis.title: Control axis title appearance
  • legend.position: Position of legend ("top", "bottom", "left", "right", "none")
  • panel.grid.minor: Control minor grid lines with element_blank() to remove

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  theme(
    axis.text = element_text(size = 12),
    legend.position = "none",
    panel.grid.minor = element_blank()
  )


D.15.11 element_text() and element_blank()

Helper functions used inside theme(). element_text() styles text elements; element_blank() removes elements entirely.

Arguments (element_text()):

  • size: Font size
  • face: Font face ("bold", "italic")
  • angle: Rotation angle
  • hjust, vjust: Horizontal and vertical justification

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  labs(title = "MPG vs Horsepower") +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    panel.grid.minor = element_blank()
  )


D.15.12 geom_text() and geom_label()

Add text labels to data points. geom_label() draws a rectangle behind the text for readability.

Arguments:

  • aes(label): The text to display
  • nudge_x, nudge_y: Offset the label from the point
  • size: Text size
  • hjust, vjust: Horizontal and vertical justification

Example:

top_cars <- mtcars |>
  mutate(car = rownames(mtcars)) |>
  slice(1:5)

ggplot(top_cars, aes(x = hp, y = mpg, label = car)) +
  geom_point() +
  geom_text(nudge_y = 1, size = 3)


D.15.13 geom_rug()

Add small tick marks along the axes showing the marginal distribution of the data. Useful for showing where observations are concentrated.

Arguments:

  • sides: Which sides to draw rugs on ("b" = bottom, "l" = left, "bl" = both)
  • alpha: Transparency

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_rug(alpha = 0.5)


D.15.14 scale_x_continuous() and scale_y_continuous()

Customize the x or y axis for continuous variables: set limits, breaks, and labels.

Arguments:

  • breaks: Where to place tick marks
  • labels: Labels for the tick marks
  • limits: Range of the axis
  • expand: Expansion around the data range

Example:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  scale_x_continuous(breaks = seq(50, 350, by = 50)) +
  scale_y_continuous(limits = c(10, 35))


D.15.15 guides() and guide_legend()

Customize or remove legends. guides() controls which aesthetics get legends; guide_legend() customizes legend appearance.

Arguments (guides()):

  • Aesthetic names (e.g., color, fill, size) set to "none" to remove or guide_legend() to customize

Example:

ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  guides(color = guide_legend(title = "Cylinders"))


D.16 Quick Reference Tables

D.16.1 Data Import/Export Functions

Table D.1: Data import/export functions
Function Purpose Package
read_csv() Read CSV files readr (tidyverse)
read_excel() Read Excel files readxl
read_dta() Read Stata files haven
read_rds() Read R data files readr (tidyverse)
write_csv() Write CSV files readr (tidyverse)
write_rds() Write R data files readr (tidyverse)
write_dta() Write Stata files haven

D.16.2 Descriptive Statistics Functions

Table D.2: Descriptive statistics functions
Function Purpose Key Argument
mean() Average na.rm = TRUE
sd() Standard deviation na.rm = TRUE
median() Middle value na.rm = TRUE
min(), max() Extremes na.rm = TRUE
sum() Total na.rm = TRUE
var() Variance na.rm = TRUE
cor() Correlation use = "complete.obs"
summary() Multiple stats
datasummary_skim() Quick summary table modelsummary

D.16.3 Data Manipulation Functions (dplyr)

Table D.3: Data manipulation functions
Function Purpose Example
select() Choose columns select(data, col1, col2)
filter() Choose rows filter(data, x > 5)
mutate() Create/modify variables mutate(data, new = x * 2)
group_by() Group data group_by(data, category)
summarize() Aggregate summarize(data, avg = mean(x))
count() Count rows count(data, category)
arrange() Sort rows arrange(data, desc(x))
rename() Rename columns rename(data, new = old)
slice() Select rows by position slice(data, 1:10)
across() Apply function to multiple columns across(where(is.numeric), mean)
ungroup() Remove grouping ungroup(data)

D.16.4 Join Functions (dplyr)

Table D.4: Join functions
Function Keeps
left_join(x, y) All rows from x, matching rows from y
right_join(x, y) All rows from y, matching rows from x
inner_join(x, y) Only rows with matches in both
full_join(x, y) All rows from both

D.16.5 Reshaping Functions (tidyr)

Table D.5: Reshaping functions
Function Purpose Key Arguments
pivot_longer() Wide to long cols, names_to, values_to
pivot_wider() Long to wide names_from, values_from

D.16.6 Regression and Diagnostic Functions

Table D.6: Regression and diagnostic functions
Function Purpose Package
lm() OLS regression base R
glm() Generalized linear models base R
summary() Model results base R
coef() Coefficients base R
confint() Confidence intervals base R
predict() Fitted values base R
residuals() / resid() Residuals base R
anova() F-test (compare models) base R
nobs() Number of observations base R
df.residual() Residual degrees of freedom base R
feols() Fixed effects + robust SE fixest
feglm() GLM with fixed effects fixest
fixef() Extract fixed effects fixest
bptest() Breusch-Pagan test lmtest
coeftest() Test with robust SEs lmtest
vcovHC() Robust covariance matrix sandwich
modelsummary() Formatted tables modelsummary

D.16.7 Statistical Distribution Functions

Table D.7: Statistical distribution functions
Function Purpose Example
dnorm(), dt() Density (PDF height) dnorm(0), dt(2, df=30)
pt() CDF for t-distribution pt(2.3, df=100)
qt() Critical values (t-dist) qt(0.025, df=100)
qf() Critical values (F-dist) qf(0.05, df1=2, df2=494)
rnorm() Random normal draws rnorm(100, mean=0, sd=1)
runif() Random uniform draws runif(100, min=0, max=1)
rbinom() Random binomial draws rbinom(100, size=1, prob=0.5)
rexp() Random exponential draws rexp(100, rate=0.5)
rt() Random t-dist draws rt(100, df=30)

D.16.8 ggplot2 Geometries and Customization

Table D.8: ggplot2 geometries and customization
Function Plot Type / Purpose
geom_point() Scatterplot
geom_line() Line plot
geom_histogram() Histogram
geom_density() Smoothed density
geom_boxplot() Boxplot
geom_bar() Bar chart (counts)
geom_col() Bar chart (values)
geom_smooth() Smoothed line/regression
geom_segment() Line segments/arrows
geom_ribbon() / geom_area() Shaded regions
geom_errorbar() Error bars / CIs
geom_hline() / geom_vline() Reference lines
geom_text() / geom_label() Data labels
geom_rug() Marginal tick marks
annotate() Text/shape annotations
stat_function() Overlay math functions
theme() Customize plot appearance
scale_x_continuous() / scale_y_continuous() Customize axes
scale_color_manual() / scale_fill_manual() Custom colors
coord_cartesian() Zoom without dropping data
coord_flip() Flip axes
guides() / guide_legend() Customize legends

D.16.9 Utility Functions

Table D.9: Utility functions
Function Purpose
c() Combine values into a vector
data.frame() / tibble() Create data frames
paste() / paste0() Concatenate strings
sprintf() / format() Format strings and numbers
log() / exp() Natural log and exponential
sqrt() / abs() Square root and absolute value
round() Round to decimal places
cumsum() Cumulative sum
length() / unique() Vector length / unique values
which() / table() Find indices / frequency table
names() Get/set names
is.na() Check for missing values
factor() Create categorical variable