Appendix C — Importing and Working with Data

Key Questions

How do I import data files into R?
What are README files and codebooks, and why do they matter?
What are the first things I should do after loading a dataset?
What is the difference between wide and long data?
How do I reshape data between wide and long formats?
How do I export data from R?

Suggested Readings

Wickham et al. (2023), Ch. 7-8, 11

Every empirical analysis starts with the same step: getting data into R. Whether you’re working with a dataset from a published paper, government statistics, or data you collected yourself, you need to know how to import it, understand its structure, and get it into the right shape for your analysis. This appendix walks you through the entire process, from reading in a data file to reshaping and exporting your results.

C.1 README Files and Codebooks

Before you ever open a dataset in R, you should look for its documentation. Two types of documentation are especially important: the README file and the codebook.

C.1.1 README Files

A README file provides a high-level overview of a dataset or project. It typically contains information about where the data came from, when it was collected, who collected it, and any important notes about how the data should be used or cited. README files are usually plain text (.txt) or Markdown (.md) files.

A good README for a research dataset might look something like this:

================================================================
README: County-Level Birth Outcomes Data
================================================================

Source:       National Vital Statistics System (NVSS)
Years:        2015-2020
Unit:         County-year
Coverage:     All U.S. counties with 100+ births per year

Description:
  This dataset contains county-level measures of birth outcomes
  including low birth weight rates, preterm birth rates, and
  infant mortality rates. Data are derived from individual birth
  certificate records aggregated to the county level.

Files:
  birth_outcomes.csv     - Main analysis dataset
  county_covariates.csv  - County demographic characteristics
  codebook.txt           - Variable descriptions and coding

Notes:
  - Counties with fewer than 100 births in a given year are
    suppressed for privacy.
  - Infant mortality rates are per 1,000 live births.
  - FIPS codes follow 2015 county definitions.

Citation:
  National Center for Health Statistics. National Vital
  Statistics System, Natality Data, 2015-2020.
================================================================

Even if a dataset doesn’t come with a README, you should create one for your own projects. When you come back to an analysis six months later, you will thank yourself for writing down where the data came from and what the key variables mean.

C.1.2 Codebooks

A codebook (sometimes called a data dictionary) is more detailed than a README. It describes every variable in the dataset: what it measures, what values it can take, how missing values are coded, and any important notes about interpretation.

Here’s an example of what a codebook might look like for the birth outcomes dataset described above:

================================================================
CODEBOOK: birth_outcomes.csv
================================================================

Variable        Type      Description
----------------------------------------------------------------
fips            character 5-digit county FIPS code
state           character 2-letter state abbreviation
county_name     character County name
year            integer   Calendar year (2015-2020)
total_births    integer   Total number of live births
lbw_rate        numeric   Low birth weight rate (% of births
                          under 2,500 grams)
preterm_rate    numeric   Preterm birth rate (% of births
                          before 37 weeks gestation)
imr             numeric   Infant mortality rate
                          (deaths per 1,000 live births)
median_income   numeric   County median household income
                          (in 2020 dollars)
pct_uninsured   numeric   Percent of population uninsured
urban           integer   1 = metropolitan county,
                          0 = non-metropolitan county

Missing Values: Coded as NA
================================================================

Always Look for Documentation First

Before you start analyzing any dataset, look for the README and codebook. Understanding what your variables actually measure—and how they are coded—is essential for avoiding errors in your analysis. For example, if you don’t know that imr is measured per 1,000 live births, you might mistakenly interpret a coefficient as a percentage change.

C.2 Importing Data

The tidyverse includes a package called readr that provides fast, consistent functions for reading data files into R. When you load the tidyverse with library(tidyverse), readr is automatically loaded for you.

C.2.1 CSV Files

The most common data format you’ll encounter is the CSV (comma-separated values) file. A CSV file is essentially a plain text file where each row is an observation and each value is separated by commas. To read a CSV file, use read_csv():

# Read a CSV file from your working directory
birth_data <- read_csv("birth_outcomes.csv")

When you run read_csv(), it will print a message showing you what type it assigned to each column. This is helpful for catching issues—for example, if a column you expected to be numeric shows up as character, that might indicate there are non-numeric entries (like “N/A” or “-”) in the data that need to be addressed.

You can also read CSV files directly from a URL, which is convenient for datasets hosted online:

# Read a CSV file from a URL
storms_data <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv")

head(storms_data)

# A tibble: 6 × 14
  rownames name   year month   day  hour   lat  long status       category  wind
     <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>           <dbl> <dbl>
1        1 Amy    1975     6    27     0  27.5 -79   tropical de…       NA    25
2        2 Amy    1975     6    27     6  28.5 -79   tropical de…       NA    25
3        3 Amy    1975     6    27    12  29.5 -79   tropical de…       NA    25
4        4 Amy    1975     6    27    18  30.5 -79   tropical de…       NA    25
5        5 Amy    1975     6    28     0  31.5 -78.8 tropical de…       NA    25
6        6 Amy    1975     6    28     6  32.4 -78.7 tropical de…       NA    25
# ℹ 3 more variables: pressure <dbl>, tropicalstorm_force_diameter <dbl>,
#   hurricane_force_diameter <dbl>

C.2.2 Other Delimited Files

Not all text-based data files use commas as separators. Some use tabs, semicolons, or other characters. The readr package (included when loading the tidyverse) has functions for common variants:

# Tab-separated files
my_data <- read_tsv("data_file.tsv")

# Semicolon-separated files (common in some European data)
my_data <- read_delim("data_file.csv", delim = ";")

C.2.3 Excel Files

Many datasets, especially those from government agencies or organizations, come in Excel format (.xlsx or .xls). To read Excel files, you need the readxl package, which is installed as part of the tidyverse but not loaded automatically:

library(readxl)

# Read the first sheet of an Excel file
my_data <- read_excel("data_file.xlsx")

# Read a specific sheet
my_data <- read_excel("data_file.xlsx", sheet = "Sheet2")

# Read a specific sheet by number
my_data <- read_excel("data_file.xlsx", sheet = 2)

C.2.4 Stata and Other Statistical Software Files

If you’re working with data from other statistical software, the haven package (also installed with tidyverse but not loaded automatically) can read files from Stata (.dta), SPSS (.sav), and SAS (.sas7bdat):

library(haven)

# Read a Stata file
stata_data <- read_dta("data_file.dta")

# Read an SPSS file
spss_data <- read_sav("data_file.sav")

Stata files are particularly common in economics research, so read_dta() is a function you will use often.

C.2.5 R Data Files

R has its own file formats for saving data. The most common is .rds, which stores a single R object:

# Read an RDS file
my_data <- read_rds("data_file.rds")

You may also encounter .RData or .rda files, which can store multiple objects at once. These are loaded using load():

# Load an RData file (objects are loaded into your environment
# with whatever names they were saved with)
load("data_file.RData")

Which Function Do I Use?

Here’s a quick reference for which import function to use based on your file type:

Table C.1: Import functions by file type

File Extension	Function	Package
`.csv`	`read_csv()`	readr (tidyverse)
`.tsv`	`read_tsv()`	readr (tidyverse)
`.xlsx` / `.xls`	`read_excel()`	readxl
`.dta`	`read_dta()`	haven
`.sav`	`read_sav()`	haven
`.rds`	`read_rds()`	readr (tidyverse)
`.RData`	`load()`	base R

C.3 First Things to Do with Your Data

You’ve imported your data—now what? Before jumping into any analysis, you should take a few minutes to explore the dataset and make sure everything looks right. For a deeper discussion of why this matters—and how descriptive statistics and visualization can help you catch errors—see Chapter 2. Here we’ll focus on the three functions you should run every time you load a new dataset.

Let’s demonstrate using the penguins dataset from the palmerpenguins package:

library(palmerpenguins)
data(penguins)

C.3.1 `head()`: Look at the First Few Rows

The head() function shows you the first few rows of your data, giving you an immediate sense of what variables are present and what the values look like:

head(penguins)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

By default, head() shows 6 rows. You can change this with the n argument (e.g., head(penguins, n = 10)).

C.3.2 `glimpse()`: Get a Compact Overview

The glimpse() function shows every column’s name, data type, and first several values. This is one of the most efficient ways to quickly understand a dataset’s structure:

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

From this output, we can immediately see that the dataset has 344 rows and 8 columns, and we can verify that each variable has the type we’d expect (e.g., species is a factor, bill_length_mm is a double).

C.3.3 `summary()`: Check Key Statistics

The summary() function provides basic descriptive statistics for every variable—minimums, maximums, means, medians, and counts of missing values:

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

This lets you quickly verify that variable ranges make sense and spot any missing data. For example, we can see that sex has 11 NA's—something to keep in mind for our analysis.

C.4 Data Structure: Wide vs. Long

One of the most important concepts in data management is the distinction between wide and long data formats. The same information can be arranged in fundamentally different ways, and the format you need depends on what you’re trying to do with the data. Understanding this distinction—and knowing how to switch between formats—is a crucial skill for applied econometrics.

C.4.1 Wide Data

In wide format, each row represents a single observational unit (like a country, state, or person), and each time period or category gets its own column. This is how data often appears in spreadsheets and published tables.

Here’s an example of wide-format data showing GDP per capita for several countries across three years:

gdp_wide <- tibble(
  country = c("United States", "United Kingdom", "Germany", "Japan"),
  gdp_2018 = c(62887, 42943, 47603, 39287),
  gdp_2019 = c(65298, 42330, 46468, 40247),
  gdp_2020 = c(63544, 40285, 45724, 39539)
)

gdp_wide

# A tibble: 4 × 4
  country        gdp_2018 gdp_2019 gdp_2020
  <chr>             <dbl>    <dbl>    <dbl>
1 United States     62887    65298    63544
2 United Kingdom    42943    42330    40285
3 Germany           47603    46468    45724
4 Japan             39287    40247    39539

Wide data is easy to read as a table. You can quickly compare across years for a given country by scanning across a row, or compare across countries for a given year by scanning down a column.

C.4.2 Long Data

In long format, each row represents a single observation for a single unit at a single point in time. Instead of having separate columns for each year, there is one column that identifies the time period and another column that holds the value.

Here is the same GDP data in long format:

gdp_long <- tibble(
  country = rep(c("United States", "United Kingdom", "Germany", "Japan"), each = 3),
  year = rep(c(2018, 2019, 2020), times = 4),
  gdp_per_capita = c(62887, 65298, 63544,
                     42943, 42330, 40285,
                     47603, 46468, 45724,
                     39287, 40247, 39539)
)

gdp_long

# A tibble: 12 × 3
   country         year gdp_per_capita
   <chr>          <dbl>          <dbl>
 1 United States   2018          62887
 2 United States   2019          65298
 3 United States   2020          63544
 4 United Kingdom  2018          42943
 5 United Kingdom  2019          42330
 6 United Kingdom  2020          40285
 7 Germany         2018          47603
 8 Germany         2019          46468
 9 Germany         2020          45724
10 Japan           2018          39287
11 Japan           2019          40247
12 Japan           2020          39539

Long data has more rows but fewer columns. Each row is a single country-year observation. This format might seem less intuitive at first, but it is often the format that R functions—especially ggplot2 for visualization and regression functions—expect.

C.4.3 When to Use Which Format

Wide format is natural for presenting data in tables, reading data quickly by eye, and working with data where you want to compare values across time periods within a row.

Long format is necessary for most data analysis in R. Specifically, ggplot2 works best with long data because you can map variables like year to aesthetics (like color or facets). Regression functions like lm() and feols() require long data when you want to include time as a variable. The dplyr verbs like group_by() and summarize() are designed for long data.

As a general rule: if your data has column names that contain variable values (like gdp_2018, gdp_2019, gdp_2020), it is probably in wide format and may need to be converted to long format for analysis.

C.5 Reshaping Data with `pivot_longer()` and `pivot_wider()`

The tidyr package (part of the tidyverse) provides two functions for reshaping data: pivot_longer() converts wide data to long, and pivot_wider() converts long data to wide.

C.5.1 Wide to Long with `pivot_longer()`

Let’s convert our wide GDP data to long format:

gdp_long <- gdp_wide |>
  pivot_longer(
    cols = starts_with("gdp_"),
    names_to = "year",
    values_to = "gdp_per_capita"
  )

gdp_long

# A tibble: 12 × 3
   country        year     gdp_per_capita
   <chr>          <chr>             <dbl>
 1 United States  gdp_2018          62887
 2 United States  gdp_2019          65298
 3 United States  gdp_2020          63544
 4 United Kingdom gdp_2018          42943
 5 United Kingdom gdp_2019          42330
 6 United Kingdom gdp_2020          40285
 7 Germany        gdp_2018          47603
 8 Germany        gdp_2019          46468
 9 Germany        gdp_2020          45724
10 Japan          gdp_2018          39287
11 Japan          gdp_2019          40247
12 Japan          gdp_2020          39539

Let’s break down the three key arguments:

cols tells R which columns to “pivot” or stack. Here we use starts_with("gdp_") to select all columns whose names begin with "gdp_". You could also list the columns explicitly as cols = c(gdp_2018, gdp_2019, gdp_2020).
names_to is the name of the new column that will store the old column names. We call it "year" because the old column names (gdp_2018, gdp_2019, gdp_2020) represent years.
values_to is the name of the new column that will store the values. We call it "gdp_per_capita".

Notice that the year column still has the gdp_ prefix attached. We can clean that up by adding names_prefix to strip the prefix:

gdp_long <- gdp_wide |>
  pivot_longer(
    cols = starts_with("gdp_"),
    names_to = "year",
    names_prefix = "gdp_",
    values_to = "gdp_per_capita"
  )

gdp_long

# A tibble: 12 × 3
   country        year  gdp_per_capita
   <chr>          <chr>          <dbl>
 1 United States  2018           62887
 2 United States  2019           65298
 3 United States  2020           63544
 4 United Kingdom 2018           42943
 5 United Kingdom 2019           42330
 6 United Kingdom 2020           40285
 7 Germany        2018           47603
 8 Germany        2019           46468
 9 Germany        2020           45724
10 Japan          2018           39287
11 Japan          2019           40247
12 Japan          2020           39539

Now the year column contains just the year numbers. However, notice that year is still a character type (because it was extracted from column names). We can convert it to numeric by adding names_transform:

gdp_long <- gdp_wide |>
  pivot_longer(
    cols = starts_with("gdp_"),
    names_to = "year",
    names_prefix = "gdp_",
    names_transform = list(year = as.integer),
    values_to = "gdp_per_capita"
  )

gdp_long

# A tibble: 12 × 3
   country         year gdp_per_capita
   <chr>          <int>          <dbl>
 1 United States   2018          62887
 2 United States   2019          65298
 3 United States   2020          63544
 4 United Kingdom  2018          42943
 5 United Kingdom  2019          42330
 6 United Kingdom  2020          40285
 7 Germany         2018          47603
 8 Germany         2019          46468
 9 Germany         2020          45724
10 Japan           2018          39287
11 Japan           2019          40247
12 Japan           2020          39539

Now year is a proper integer column, ready for use in regressions or plots.

C.5.2 A More Detailed Example

Let’s work through a slightly more complex example. Suppose you have state-level data on unemployment rates and poverty rates across several years:

state_wide <- tibble(
  state = c("Maine", "Vermont", "New Hampshire"),
  unemp_2018 = c(3.4, 2.7, 2.5),
  unemp_2019 = c(3.0, 2.4, 2.6),
  unemp_2020 = c(5.4, 5.0, 6.2),
  poverty_2018 = c(12.9, 11.3, 7.6),
  poverty_2019 = c(11.6, 10.2, 7.3),
  poverty_2020 = c(11.4, 10.8, 7.1)
)

state_wide

# A tibble: 3 × 7
  state  unemp_2018 unemp_2019 unemp_2020 poverty_2018 poverty_2019 poverty_2020
  <chr>       <dbl>      <dbl>      <dbl>        <dbl>        <dbl>        <dbl>
1 Maine         3.4        3          5.4         12.9         11.6         11.4
2 Vermo…        2.7        2.4        5           11.3         10.2         10.8
3 New H…        2.5        2.6        6.2          7.6          7.3          7.1

This dataset has two variables (unemployment and poverty) each measured across three years. To pivot this to long format, we need to handle the fact that column names encode two pieces of information: the variable name and the year. We use names_sep to tell pivot_longer() how to split the column names:

state_long <- state_wide |>
  pivot_longer(
    cols = -state,
    names_to = c("variable", "year"),
    names_sep = "_",
    values_to = "value"
  )

state_long

# A tibble: 18 × 4
   state         variable year  value
   <chr>         <chr>    <chr> <dbl>
 1 Maine         unemp    2018    3.4
 2 Maine         unemp    2019    3  
 3 Maine         unemp    2020    5.4
 4 Maine         poverty  2018   12.9
 5 Maine         poverty  2019   11.6
 6 Maine         poverty  2020   11.4
 7 Vermont       unemp    2018    2.7
 8 Vermont       unemp    2019    2.4
 9 Vermont       unemp    2020    5  
10 Vermont       poverty  2018   11.3
11 Vermont       poverty  2019   10.2
12 Vermont       poverty  2020   10.8
13 New Hampshire unemp    2018    2.5
14 New Hampshire unemp    2019    2.6
15 New Hampshire unemp    2020    6.2
16 New Hampshire poverty  2018    7.6
17 New Hampshire poverty  2019    7.3
18 New Hampshire poverty  2020    7.1

Here we used cols = -state to pivot all columns except state. The names_sep = "_" tells R to split each column name at the underscore, putting the first part into a column called variable and the second part into a column called year.

If you’d prefer each variable to have its own column (which is usually more useful for analysis), you can add one more step using pivot_wider():

state_tidy <- state_long |>
  pivot_wider(
    names_from = variable,
    values_from = value
  )

state_tidy

# A tibble: 9 × 4
  state         year  unemp poverty
  <chr>         <chr> <dbl>   <dbl>
1 Maine         2018    3.4    12.9
2 Maine         2019    3      11.6
3 Maine         2020    5.4    11.4
4 Vermont       2018    2.7    11.3
5 Vermont       2019    2.4    10.2
6 Vermont       2020    5      10.8
7 New Hampshire 2018    2.5     7.6
8 New Hampshire 2019    2.6     7.3
9 New Hampshire 2020    6.2     7.1

Now each row is a state-year observation with separate columns for unemployment and poverty—exactly the format you’d want for a regression or a plot.

C.5.3 Long to Wide with `pivot_wider()`

Sometimes you need to go in the other direction—converting long data to wide. This is common when creating summary tables or when a dataset arrived in long format but you need it wide for a specific purpose.

pivot_wider() is essentially the reverse of pivot_longer():

gdp_wide_again <- gdp_long |>
  pivot_wider(
    names_from = year,
    values_from = gdp_per_capita,
    names_prefix = "gdp_"
  )

gdp_wide_again

# A tibble: 4 × 4
  country        gdp_2018 gdp_2019 gdp_2020
  <chr>             <dbl>    <dbl>    <dbl>
1 United States     62887    65298    63544
2 United Kingdom    42943    42330    40285
3 Germany           47603    46468    45724
4 Japan             39287    40247    39539

The key arguments for pivot_wider() are:

names_from specifies which column contains the values that will become the new column names.
values_from specifies which column contains the values that will fill those new columns.
names_prefix (optional) adds a prefix to the new column names. Without it, the columns would just be named 2018, 2019, 2020—which are valid R column names but harder to work with because they start with numbers.

C.5.4 Another `pivot_wider()` Example

Suppose you have panel data on birth outcomes and want to create a table comparing states:

birth_long <- tibble(
  state = rep(c("Maine", "Vermont", "New Hampshire"), each = 3),
  year = rep(2018:2020, times = 3),
  lbw_rate = c(7.1, 7.3, 7.0,
               6.8, 6.5, 6.9,
               6.2, 6.4, 6.1)
)

birth_long

# A tibble: 9 × 3
  state          year lbw_rate
  <chr>         <int>    <dbl>
1 Maine          2018      7.1
2 Maine          2019      7.3
3 Maine          2020      7  
4 Vermont        2018      6.8
5 Vermont        2019      6.5
6 Vermont        2020      6.9
7 New Hampshire  2018      6.2
8 New Hampshire  2019      6.4
9 New Hampshire  2020      6.1

To create a wide table comparing states across years:

birth_wide <- birth_long |>
  pivot_wider(
    names_from = year,
    values_from = lbw_rate
  )

birth_wide

# A tibble: 3 × 4
  state         `2018` `2019` `2020`
  <chr>          <dbl>  <dbl>  <dbl>
1 Maine            7.1    7.3    7  
2 Vermont          6.8    6.5    6.9
3 New Hampshire    6.2    6.4    6.1

This wide format is convenient for a presentation table where you want to compare across years at a glance.

Remembering the Difference

A helpful way to remember: pivot_longer() makes your data longer (more rows, fewer columns), and pivot_wider() makes your data wider (fewer rows, more columns). If your column names contain data values (like years), you probably need pivot_longer(). If you want to spread values across columns, you need pivot_wider().

C.6 Merging Data with Joins

In applied economics, the data you need for an analysis rarely comes in a single file. You might have county-level health outcomes in one dataset and county-level demographic characteristics in another, or state policy variables in one file and state economic indicators in a second. To bring these together for regression analysis, you need to merge (or join) the datasets.

The tidyverse provides a family of _join() functions that merge two data frames based on one or more shared key variables. The key variable is simply a column that appears in both datasets and identifies which rows should be matched—a county FIPS code, a state abbreviation, a year, or some combination of these.

C.6.1 A Simple Example

Suppose you have two small datasets about states. One contains population data and the other contains policy information:

library(tidyverse)

# Population data
pop_data <- tibble(
    state = c("NC", "VA", "SC", "GA"),
    population = c(10.7, 8.6, 5.2, 10.9)
)

# Policy data
policy_data <- tibble(
    state = c("NC", "VA", "SC", "TN"),
    medicaid_expanded = c(TRUE, TRUE, FALSE, TRUE)
)

pop_data

# A tibble: 4 × 2
  state population
  <chr>      <dbl>
1 NC          10.7
2 VA           8.6
3 SC           5.2
4 GA          10.9

policy_data

# A tibble: 4 × 2
  state medicaid_expanded
  <chr> <lgl>            
1 NC    TRUE             
2 VA    TRUE             
3 SC    FALSE            
4 TN    TRUE

Notice that the two datasets overlap but are not identical: both contain NC, VA, and SC, but pop_data has GA while policy_data has TN. How the join handles these non-overlapping rows is what distinguishes the different join types.

C.6.2 `left_join()`

A left join keeps all rows from the first (left) dataset and adds matching columns from the second (right) dataset. Rows in the left dataset that have no match in the right dataset get NA for the new columns:

left_join(pop_data, policy_data, by = "state")

# A tibble: 4 × 3
  state population medicaid_expanded
  <chr>      <dbl> <lgl>            
1 NC          10.7 TRUE             
2 VA           8.6 TRUE             
3 SC           5.2 FALSE            
4 GA          10.9 NA

All four states from pop_data are retained. GA has no match in policy_data, so its medicaid_expanded value is NA. TN, which appears only in policy_data, is dropped because it is not in the left dataset.

This is the most commonly used join in applied work. When you have a “main” analysis dataset and want to add variables from a supplementary source, left_join() ensures you don’t lose any of your primary observations.

C.6.3 `right_join()`

A right join is the mirror image: it keeps all rows from the second (right) dataset and fills in NA where the left dataset has no match:

right_join(pop_data, policy_data, by = "state")

# A tibble: 4 × 3
  state population medicaid_expanded
  <chr>      <dbl> <lgl>            
1 NC          10.7 TRUE             
2 VA           8.6 TRUE             
3 SC           5.2 FALSE            
4 TN          NA   TRUE

Now TN is retained (with NA for population) and GA is dropped. In practice, right_join() is less common because you can always achieve the same result by switching the order of the datasets in a left_join().

C.6.4 `inner_join()`

An inner join keeps only rows that have a match in both datasets:

inner_join(pop_data, policy_data, by = "state")

# A tibble: 3 × 3
  state population medicaid_expanded
  <chr>      <dbl> <lgl>            
1 NC          10.7 TRUE             
2 VA           8.6 TRUE             
3 SC           5.2 FALSE

Only NC, VA, and SC appear—they are the states present in both datasets. GA and TN are both dropped.

Use inner_join() when you only want observations for which you have complete data from both sources. The downside is that you may silently lose observations, so always check how many rows you started with and how many the join returns.

C.6.5 `full_join()`

A full join keeps all rows from both datasets, filling in NA wherever a match is missing:

full_join(pop_data, policy_data, by = "state")

# A tibble: 5 × 3
  state population medicaid_expanded
  <chr>      <dbl> <lgl>            
1 NC          10.7 TRUE             
2 VA           8.6 TRUE             
3 SC           5.2 FALSE            
4 GA          10.9 NA               
5 TN          NA   TRUE

All five states appear. GA has NA for medicaid_expanded and TN has NA for population. This is useful when you want to see the complete universe of observations across both sources and deal with missing values later.

C.6.6 Joining on Multiple Key Variables

Many economics datasets require matching on more than one variable. Panel data, for instance, typically needs to be joined on both a unit identifier and a time variable. You can pass a vector of column names to the by argument:

# State-year economic data
econ_data <- tibble(
    state = c("NC", "NC", "VA", "VA"),
    year  = c(2020, 2021, 2020, 2021),
    unem_rate = c(6.8, 4.1, 6.2, 3.8)
)

# State-year health data
health_data <- tibble(
    state = c("NC", "NC", "VA", "VA"),
    year  = c(2020, 2021, 2020, 2021),
    infant_mort = c(7.1, 6.9, 5.8, 5.6)
)

left_join(econ_data, health_data, by = c("state", "year"))

# A tibble: 4 × 4
  state  year unem_rate infant_mort
  <chr> <dbl>     <dbl>       <dbl>
1 NC     2020       6.8         7.1
2 NC     2021       4.1         6.9
3 VA     2020       6.2         5.8
4 VA     2021       3.8         5.6

Each row is matched on the combination of state and year, so North Carolina in 2020 gets paired with the correct infant mortality rate for that state and year. If you only joined on state, each NC row would match both NC-2020 and NC-2021 from the other dataset, creating duplicate rows—a common mistake that inflates your sample size and produces incorrect results.

C.6.7 Joining When Key Variables Have Different Names

Sometimes the key variable is named differently in the two datasets. Rather than renaming a column before the merge, you can specify the mapping directly:

# This dataset uses "st" instead of "state"
policy_data_alt <- tibble(
    st = c("NC", "VA", "SC", "TN"),
    medicaid_expanded = c(TRUE, TRUE, FALSE, TRUE)
)

left_join(pop_data, policy_data_alt, by = c("state" = "st"))

# A tibble: 4 × 3
  state population medicaid_expanded
  <chr>      <dbl> <lgl>            
1 NC          10.7 TRUE             
2 VA           8.6 TRUE             
3 SC           5.2 FALSE            
4 GA          10.9 NA

The by = c("state" = "st") syntax tells R to match state in the left dataset to st in the right dataset. The output keeps the column name from the left dataset.

C.6.8 Diagnosing Join Problems

Merges can go wrong in ways that are easy to miss if you don’t check. Here are two things to watch for after every join:

Check the row count. If your merged dataset has more rows than your original dataset, you likely have duplicate keys in the right-hand dataset—meaning one row in your main data matched multiple rows in the supplementary data. This is the most common join mistake in applied work:

# Accidental duplicates in the lookup table
bad_lookup <- tibble(
    state = c("NC", "NC", "VA", "SC"),
    region = c("South", "Southeast", "South", "South")
)

# This produces MORE rows than we started with
result <- left_join(pop_data, bad_lookup, by = "state")
result

# A tibble: 5 × 3
  state population region   
  <chr>      <dbl> <chr>    
1 NC          10.7 South    
2 NC          10.7 Southeast
3 VA           8.6 South    
4 SC           5.2 South    
5 GA          10.9 <NA>

nrow(pop_data)

[1] 4

nrow(result)

[1] 5

NC appeared twice in bad_lookup, so the join created two rows for NC. If this goes unnoticed, every subsequent analysis is wrong. Always compare nrow() before and after a join.

Check for unexpected NA values. After a left_join(), scan the new columns for NA values. A handful of missing values might be expected (not every observation will have a match), but a large number suggests a problem with the key variable—perhaps different formatting, spelling differences, or leading zeros in FIPS codes that were stripped when the data was read in.

Join Quick Reference

Table C.2: Join functions and what they retain

Function	Keeps
`left_join(x, y)`	All rows from `x`, matching rows from `y`
`right_join(x, y)`	All rows from `y`, matching rows from `x`
`inner_join(x, y)`	Only rows with matches in both
`full_join(x, y)`	All rows from both

C.7 Exporting Data

After cleaning, reshaping, or creating a new dataset, you’ll often want to save it for future use or to share with others. The tidyverse provides straightforward functions for exporting data.

C.7.1 Writing CSV Files

The most common export format is CSV, because it can be opened by virtually any software:

# Write a data frame to a CSV file
write_csv(gdp_long, "gdp_data_clean.csv")

This creates a file called gdp_data_clean.csv in your working directory. The file will contain the data in plain text with commas separating the values—no special formatting, no formulas, just clean data.

C.7.2 Writing R Data Files

If you’re saving data that will only be used in R, the .rds format is preferable because it preserves all of R’s data types (factors, dates, etc.) and is more compact than CSV:

# Save as an RDS file
write_rds(gdp_long, "gdp_data_clean.rds")

To read it back in later:

gdp_long <- read_rds("gdp_data_clean.rds")

C.7.3 Writing Stata Files

If you need to share data with someone using Stata, the haven package can write .dta files:

library(haven)

write_dta(gdp_long, "gdp_data_clean.dta")

C.7.4 Writing Excel Files

To export to Excel format, you can use the writexl package:

library(writexl)

write_xlsx(gdp_long, "gdp_data_clean.xlsx")

Export Function Quick Reference

Table C.3: Export functions by file type

Format	Function	Package
CSV	`write_csv()`	readr (tidyverse)
TSV	`write_tsv()`	readr (tidyverse)
RDS	`write_rds()`	readr (tidyverse)
Stata (.dta)	`write_dta()`	haven
Excel (.xlsx)	`write_xlsx()`	writexl

C.8 Summary and Conclusion

This appendix covered the essential workflow for getting data into and out of R and understanding its structure along the way.

C.8.1 Key Takeaways

Documentation first: Always look for README files and codebooks before analyzing a dataset. Understanding what your variables measure and how they’re coded prevents errors downstream.
Import with the right function: Use read_csv() for CSV files, read_excel() for Excel files, read_dta() for Stata files, and read_rds() for R data files. The tidyverse and its companion packages cover virtually every common file format.
Explore before you analyze: Run head(), glimpse(), and summary() on every new dataset. Check for missing values and verify that variable types and ranges make sense. This quick health check catches problems early.
Know wide vs. long: Wide data has one row per unit with separate columns for time periods; long data has one row per unit-time observation. Most R analysis functions expect long data.
Reshape with pivot functions: Use pivot_longer() to go from wide to long and pivot_wider() to go from long to wide. The key arguments are cols, names_to/names_from, and values_to/values_from.
Merge with joins: Use left_join() to add variables from a supplementary dataset while keeping all rows in your main data. Always join on the correct key variables (including both unit and time identifiers for panel data), and check your row count before and after to catch duplicate-key problems.
Export in the right format: Use write_csv() for universal compatibility, write_rds() for R-only workflows, and format-specific functions like write_dta() when sharing with colleagues who use other software.

C.9 Check Your Understanding

For each question below, select the best answer from the dropdown menu. The dropdown will turn green if correct and red if incorrect. Click the “Show Explanation” toggle to see a full explanation of the answer after attempting each question.

i. Which function should you use to read a CSV file using the tidyverse?

ii. What does the glimpse() function show you?

iii. A dataset has columns named state, pop_2018, pop_2019, and pop_2020. What format is this data in?

iv. Which function converts wide data to long data?

v. In pivot_longer(), what does the values_to argument specify?

vi. What is the main advantage of saving data as an .rds file rather than a .csv file?

vii. (Challenge) You have a long dataset with columns country, year, and gdp. You want to create a table where each year is its own column. Which arguments would you use in pivot_wider()?

Show Explanation

The tidyverse function for reading CSV files is read_csv() (with an underscore), from the readr package. Note that base R has a similar function called read.csv() (with a period), but read_csv() is faster, handles column types more intelligently, and produces tibbles rather than basic data frames.
glimpse() provides a compact overview of a data frame showing every column’s name, its data type (e.g., <dbl>, <chr>, <fct>), and the first several values. This makes it one of the most efficient ways to quickly understand a dataset’s structure. head() shows rows, and summary() shows statistics.
This is wide format data. The giveaway is that the column names contain data values (the years 2018, 2019, 2020). In long format, there would be a single year column and a single population column, with each state-year combination as its own row.
pivot_longer() converts wide data to long data by stacking multiple columns into two new columns (one for the names, one for the values). pivot_wider() does the opposite. The older functions gather() and spread() did similar things but have been superseded by the pivot functions.
The values_to argument specifies the name of the new column that will contain the data values from the pivoted columns. For example, if you’re pivoting columns gdp_2018, gdp_2019, and gdp_2020, the values in those columns will all go into the column you name in values_to. The names_to argument handles the column names, and cols specifies which columns to pivot.
RDS files preserve all of R’s data types, including factors, dates, and other special types. When you save a factor as CSV and read it back, it becomes a character variable and you have to convert it again. RDS files also tend to be smaller because they use compression. However, RDS files can only be read by R, which is a trade-off compared to the universally readable CSV format.
In pivot_wider(), names_from specifies the column whose values will become the new column names (here, year values like 2018, 2019, 2020 become column names), and values_from specifies the column whose values will fill those new columns (here, the gdp values). Note that pivot_wider() uses names_from/values_from, while pivot_longer() uses names_to/values_to—the “from” and “to” suffixes indicate the direction of the transformation.