# Read a CSV file from your working directory
birth_data <- read_csv("birth_outcomes.csv")Appendix C — Importing and Working with Data
- How do I import data files into R?
- What are README files and codebooks, and why do they matter?
- What are the first things I should do after loading a dataset?
- What is the difference between wide and long data?
- How do I reshape data between wide and long formats?
- How do I export data from R?
- Wickham, Çetinkaya-Rundel, and Grolemund (2023), Ch. 7-8, 11
Every empirical analysis starts with the same step: getting data into R. Whether you’re working with a dataset from a published paper, government statistics, or data you collected yourself, you need to know how to import it, understand its structure, and get it into the right shape for your analysis. This appendix walks you through the entire process, from reading in a data file to reshaping and exporting your results.
C.1 README Files and Codebooks
Before you ever open a dataset in R, you should look for its documentation. Two types of documentation are especially important: the README file and the codebook.
C.1.1 README Files
A README file provides a high-level overview of a dataset or project. It typically contains information about where the data came from, when it was collected, who collected it, and any important notes about how the data should be used or cited. README files are usually plain text (.txt) or Markdown (.md) files.
A good README for a research dataset might look something like this:
================================================================
README: County-Level Birth Outcomes Data
================================================================
Source: National Vital Statistics System (NVSS)
Years: 2015-2020
Unit: County-year
Coverage: All U.S. counties with 100+ births per year
Description:
This dataset contains county-level measures of birth outcomes
including low birth weight rates, preterm birth rates, and
infant mortality rates. Data are derived from individual birth
certificate records aggregated to the county level.
Files:
birth_outcomes.csv - Main analysis dataset
county_covariates.csv - County demographic characteristics
codebook.txt - Variable descriptions and coding
Notes:
- Counties with fewer than 100 births in a given year are
suppressed for privacy.
- Infant mortality rates are per 1,000 live births.
- FIPS codes follow 2015 county definitions.
Citation:
National Center for Health Statistics. National Vital
Statistics System, Natality Data, 2015-2020.
================================================================
Even if a dataset doesn’t come with a README, you should create one for your own projects. When you come back to an analysis six months later, you will thank yourself for writing down where the data came from and what the key variables mean.
C.1.2 Codebooks
A codebook (sometimes called a data dictionary) is more detailed than a README. It describes every variable in the dataset: what it measures, what values it can take, how missing values are coded, and any important notes about interpretation.
Here’s an example of what a codebook might look like for the birth outcomes dataset described above:
================================================================
CODEBOOK: birth_outcomes.csv
================================================================
Variable Type Description
----------------------------------------------------------------
fips character 5-digit county FIPS code
state character 2-letter state abbreviation
county_name character County name
year integer Calendar year (2015-2020)
total_births integer Total number of live births
lbw_rate numeric Low birth weight rate (% of births
under 2,500 grams)
preterm_rate numeric Preterm birth rate (% of births
before 37 weeks gestation)
imr numeric Infant mortality rate
(deaths per 1,000 live births)
median_income numeric County median household income
(in 2020 dollars)
pct_uninsured numeric Percent of population uninsured
urban integer 1 = metropolitan county,
0 = non-metropolitan county
Missing Values: Coded as NA
================================================================
Before you start analyzing any dataset, look for the README and codebook. Understanding what your variables actually measure—and how they are coded—is essential for avoiding errors in your analysis. For example, if you don’t know that imr is measured per 1,000 live births, you might mistakenly interpret a coefficient as a percentage change.
C.2 Importing Data
The tidyverse includes a package called readr that provides fast, consistent functions for reading data files into R. When you load the tidyverse with library(tidyverse), readr is automatically loaded for you.
C.2.1 CSV Files
The most common data format you’ll encounter is the CSV (comma-separated values) file. A CSV file is essentially a plain text file where each row is an observation and each value is separated by commas. To read a CSV file, use read_csv():
When you run read_csv(), it will print a message showing you what type it assigned to each column. This is helpful for catching issues—for example, if a column you expected to be numeric shows up as character, that might indicate there are non-numeric entries (like “N/A” or “-”) in the data that need to be addressed.
You can also read CSV files directly from a URL, which is convenient for datasets hosted online:
# Read a CSV file from a URL
storms_data <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv")
head(storms_data)# A tibble: 6 × 14
rownames name year month day hour lat long status category wind
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 1 Amy 1975 6 27 0 27.5 -79 tropical de… NA 25
2 2 Amy 1975 6 27 6 28.5 -79 tropical de… NA 25
3 3 Amy 1975 6 27 12 29.5 -79 tropical de… NA 25
4 4 Amy 1975 6 27 18 30.5 -79 tropical de… NA 25
5 5 Amy 1975 6 28 0 31.5 -78.8 tropical de… NA 25
6 6 Amy 1975 6 28 6 32.4 -78.7 tropical de… NA 25
# ℹ 3 more variables: pressure <dbl>, tropicalstorm_force_diameter <dbl>,
# hurricane_force_diameter <dbl>
C.2.2 Other Delimited Files
Not all text-based data files use commas as separators. Some use tabs, semicolons, or other characters. The readr package (included when loading the tidyverse) has functions for common variants:
# Tab-separated files
my_data <- read_tsv("data_file.tsv")
# Semicolon-separated files (common in some European data)
my_data <- read_delim("data_file.csv", delim = ";")C.2.3 Excel Files
Many datasets, especially those from government agencies or organizations, come in Excel format (.xlsx or .xls). To read Excel files, you need the readxl package, which is installed as part of the tidyverse but not loaded automatically:
library(readxl)
# Read the first sheet of an Excel file
my_data <- read_excel("data_file.xlsx")
# Read a specific sheet
my_data <- read_excel("data_file.xlsx", sheet = "Sheet2")
# Read a specific sheet by number
my_data <- read_excel("data_file.xlsx", sheet = 2)C.2.4 Stata and Other Statistical Software Files
If you’re working with data from other statistical software, the haven package (also installed with tidyverse but not loaded automatically) can read files from Stata (.dta), SPSS (.sav), and SAS (.sas7bdat):
library(haven)
# Read a Stata file
stata_data <- read_dta("data_file.dta")
# Read an SPSS file
spss_data <- read_sav("data_file.sav")Stata files are particularly common in economics research, so read_dta() is a function you will use often.
C.2.5 R Data Files
R has its own file formats for saving data. The most common is .rds, which stores a single R object:
# Read an RDS file
my_data <- read_rds("data_file.rds")You may also encounter .RData or .rda files, which can store multiple objects at once. These are loaded using load():
# Load an RData file (objects are loaded into your environment
# with whatever names they were saved with)
load("data_file.RData")Here’s a quick reference for which import function to use based on your file type:
| File Extension | Function | Package |
|---|---|---|
.csv |
read_csv() |
readr (tidyverse) |
.tsv |
read_tsv() |
readr (tidyverse) |
.xlsx / .xls |
read_excel() |
readxl |
.dta |
read_dta() |
haven |
.sav |
read_sav() |
haven |
.rds |
read_rds() |
readr (tidyverse) |
.RData |
load() |
base R |
C.3 First Things to Do with Your Data
You’ve imported your data—now what? Before jumping into any analysis, you should take a few minutes to explore the dataset and make sure everything looks right. For a deeper discussion of why this matters—and how descriptive statistics and visualization can help you catch errors—see Chapter 2. Here we’ll focus on the three functions you should run every time you load a new dataset.
Let’s demonstrate using the penguins dataset from the palmerpenguins package:
library(palmerpenguins)
data(penguins)C.3.1 head(): Look at the First Few Rows
The head() function shows you the first few rows of your data, giving you an immediate sense of what variables are present and what the values look like:
head(penguins)# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
By default, head() shows 6 rows. You can change this with the n argument (e.g., head(penguins, n = 10)).
C.3.2 glimpse(): Get a Compact Overview
The glimpse() function shows every column’s name, data type, and first several values. This is one of the most efficient ways to quickly understand a dataset’s structure:
glimpse(penguins)Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
From this output, we can immediately see that the dataset has 344 rows and 8 columns, and we can verify that each variable has the type we’d expect (e.g., species is a factor, bill_length_mm is a double).
C.3.3 summary(): Check Key Statistics
The summary() function provides basic descriptive statistics for every variable—minimums, maximums, means, medians, and counts of missing values:
summary(penguins) species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
This lets you quickly verify that variable ranges make sense and spot any missing data. For example, we can see that sex has 11 NA's—something to keep in mind for our analysis.
C.4 Data Structure: Wide vs. Long
One of the most important concepts in data management is the distinction between wide and long data formats. The same information can be arranged in fundamentally different ways, and the format you need depends on what you’re trying to do with the data. Understanding this distinction—and knowing how to switch between formats—is a crucial skill for applied econometrics.
C.4.1 Wide Data
In wide format, each row represents a single observational unit (like a country, state, or person), and each time period or category gets its own column. This is how data often appears in spreadsheets and published tables.
Here’s an example of wide-format data showing GDP per capita for several countries across three years:
gdp_wide <- tibble(
country = c("United States", "United Kingdom", "Germany", "Japan"),
gdp_2018 = c(62887, 42943, 47603, 39287),
gdp_2019 = c(65298, 42330, 46468, 40247),
gdp_2020 = c(63544, 40285, 45724, 39539)
)
gdp_wide# A tibble: 4 × 4
country gdp_2018 gdp_2019 gdp_2020
<chr> <dbl> <dbl> <dbl>
1 United States 62887 65298 63544
2 United Kingdom 42943 42330 40285
3 Germany 47603 46468 45724
4 Japan 39287 40247 39539
Wide data is easy to read as a table. You can quickly compare across years for a given country by scanning across a row, or compare across countries for a given year by scanning down a column.
C.4.2 Long Data
In long format, each row represents a single observation for a single unit at a single point in time. Instead of having separate columns for each year, there is one column that identifies the time period and another column that holds the value.
Here is the same GDP data in long format:
gdp_long <- tibble(
country = rep(c("United States", "United Kingdom", "Germany", "Japan"), each = 3),
year = rep(c(2018, 2019, 2020), times = 4),
gdp_per_capita = c(62887, 65298, 63544,
42943, 42330, 40285,
47603, 46468, 45724,
39287, 40247, 39539)
)
gdp_long# A tibble: 12 × 3
country year gdp_per_capita
<chr> <dbl> <dbl>
1 United States 2018 62887
2 United States 2019 65298
3 United States 2020 63544
4 United Kingdom 2018 42943
5 United Kingdom 2019 42330
6 United Kingdom 2020 40285
7 Germany 2018 47603
8 Germany 2019 46468
9 Germany 2020 45724
10 Japan 2018 39287
11 Japan 2019 40247
12 Japan 2020 39539
Long data has more rows but fewer columns. Each row is a single country-year observation. This format might seem less intuitive at first, but it is often the format that R functions—especially ggplot2 for visualization and regression functions—expect.
C.4.3 When to Use Which Format
Wide format is natural for presenting data in tables, reading data quickly by eye, and working with data where you want to compare values across time periods within a row.
Long format is necessary for most data analysis in R. Specifically, ggplot2 works best with long data because you can map variables like year to aesthetics (like color or facets). Regression functions like lm() and feols() require long data when you want to include time as a variable. The dplyr verbs like group_by() and summarize() are designed for long data.
As a general rule: if your data has column names that contain variable values (like gdp_2018, gdp_2019, gdp_2020), it is probably in wide format and may need to be converted to long format for analysis.
C.5 Reshaping Data with pivot_longer() and pivot_wider()
The tidyr package (part of the tidyverse) provides two functions for reshaping data: pivot_longer() converts wide data to long, and pivot_wider() converts long data to wide.
C.5.1 Wide to Long with pivot_longer()
Let’s convert our wide GDP data to long format:
gdp_long <- gdp_wide |>
pivot_longer(
cols = starts_with("gdp_"),
names_to = "year",
values_to = "gdp_per_capita"
)
gdp_long# A tibble: 12 × 3
country year gdp_per_capita
<chr> <chr> <dbl>
1 United States gdp_2018 62887
2 United States gdp_2019 65298
3 United States gdp_2020 63544
4 United Kingdom gdp_2018 42943
5 United Kingdom gdp_2019 42330
6 United Kingdom gdp_2020 40285
7 Germany gdp_2018 47603
8 Germany gdp_2019 46468
9 Germany gdp_2020 45724
10 Japan gdp_2018 39287
11 Japan gdp_2019 40247
12 Japan gdp_2020 39539
Let’s break down the three key arguments:
colstells R which columns to “pivot” or stack. Here we usestarts_with("gdp_")to select all columns whose names begin with"gdp_". You could also list the columns explicitly ascols = c(gdp_2018, gdp_2019, gdp_2020).names_tois the name of the new column that will store the old column names. We call it"year"because the old column names (gdp_2018,gdp_2019,gdp_2020) represent years.values_tois the name of the new column that will store the values. We call it"gdp_per_capita".
Notice that the year column still has the gdp_ prefix attached. We can clean that up by adding names_prefix to strip the prefix:
gdp_long <- gdp_wide |>
pivot_longer(
cols = starts_with("gdp_"),
names_to = "year",
names_prefix = "gdp_",
values_to = "gdp_per_capita"
)
gdp_long# A tibble: 12 × 3
country year gdp_per_capita
<chr> <chr> <dbl>
1 United States 2018 62887
2 United States 2019 65298
3 United States 2020 63544
4 United Kingdom 2018 42943
5 United Kingdom 2019 42330
6 United Kingdom 2020 40285
7 Germany 2018 47603
8 Germany 2019 46468
9 Germany 2020 45724
10 Japan 2018 39287
11 Japan 2019 40247
12 Japan 2020 39539
Now the year column contains just the year numbers. However, notice that year is still a character type (because it was extracted from column names). We can convert it to numeric by adding names_transform:
gdp_long <- gdp_wide |>
pivot_longer(
cols = starts_with("gdp_"),
names_to = "year",
names_prefix = "gdp_",
names_transform = list(year = as.integer),
values_to = "gdp_per_capita"
)
gdp_long# A tibble: 12 × 3
country year gdp_per_capita
<chr> <int> <dbl>
1 United States 2018 62887
2 United States 2019 65298
3 United States 2020 63544
4 United Kingdom 2018 42943
5 United Kingdom 2019 42330
6 United Kingdom 2020 40285
7 Germany 2018 47603
8 Germany 2019 46468
9 Germany 2020 45724
10 Japan 2018 39287
11 Japan 2019 40247
12 Japan 2020 39539
Now year is a proper integer column, ready for use in regressions or plots.
C.5.2 A More Detailed Example
Let’s work through a slightly more complex example. Suppose you have state-level data on unemployment rates and poverty rates across several years:
state_wide <- tibble(
state = c("Maine", "Vermont", "New Hampshire"),
unemp_2018 = c(3.4, 2.7, 2.5),
unemp_2019 = c(3.0, 2.4, 2.6),
unemp_2020 = c(5.4, 5.0, 6.2),
poverty_2018 = c(12.9, 11.3, 7.6),
poverty_2019 = c(11.6, 10.2, 7.3),
poverty_2020 = c(11.4, 10.8, 7.1)
)
state_wide# A tibble: 3 × 7
state unemp_2018 unemp_2019 unemp_2020 poverty_2018 poverty_2019 poverty_2020
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Maine 3.4 3 5.4 12.9 11.6 11.4
2 Vermo… 2.7 2.4 5 11.3 10.2 10.8
3 New H… 2.5 2.6 6.2 7.6 7.3 7.1
This dataset has two variables (unemployment and poverty) each measured across three years. To pivot this to long format, we need to handle the fact that column names encode two pieces of information: the variable name and the year. We use names_sep to tell pivot_longer() how to split the column names:
state_long <- state_wide |>
pivot_longer(
cols = -state,
names_to = c("variable", "year"),
names_sep = "_",
values_to = "value"
)
state_long# A tibble: 18 × 4
state variable year value
<chr> <chr> <chr> <dbl>
1 Maine unemp 2018 3.4
2 Maine unemp 2019 3
3 Maine unemp 2020 5.4
4 Maine poverty 2018 12.9
5 Maine poverty 2019 11.6
6 Maine poverty 2020 11.4
7 Vermont unemp 2018 2.7
8 Vermont unemp 2019 2.4
9 Vermont unemp 2020 5
10 Vermont poverty 2018 11.3
11 Vermont poverty 2019 10.2
12 Vermont poverty 2020 10.8
13 New Hampshire unemp 2018 2.5
14 New Hampshire unemp 2019 2.6
15 New Hampshire unemp 2020 6.2
16 New Hampshire poverty 2018 7.6
17 New Hampshire poverty 2019 7.3
18 New Hampshire poverty 2020 7.1
Here we used cols = -state to pivot all columns except state. The names_sep = "_" tells R to split each column name at the underscore, putting the first part into a column called variable and the second part into a column called year.
If you’d prefer each variable to have its own column (which is usually more useful for analysis), you can add one more step using pivot_wider():
state_tidy <- state_long |>
pivot_wider(
names_from = variable,
values_from = value
)
state_tidy# A tibble: 9 × 4
state year unemp poverty
<chr> <chr> <dbl> <dbl>
1 Maine 2018 3.4 12.9
2 Maine 2019 3 11.6
3 Maine 2020 5.4 11.4
4 Vermont 2018 2.7 11.3
5 Vermont 2019 2.4 10.2
6 Vermont 2020 5 10.8
7 New Hampshire 2018 2.5 7.6
8 New Hampshire 2019 2.6 7.3
9 New Hampshire 2020 6.2 7.1
Now each row is a state-year observation with separate columns for unemployment and poverty—exactly the format you’d want for a regression or a plot.
C.5.3 Long to Wide with pivot_wider()
Sometimes you need to go in the other direction—converting long data to wide. This is common when creating summary tables or when a dataset arrived in long format but you need it wide for a specific purpose.
pivot_wider() is essentially the reverse of pivot_longer():
gdp_wide_again <- gdp_long |>
pivot_wider(
names_from = year,
values_from = gdp_per_capita,
names_prefix = "gdp_"
)
gdp_wide_again# A tibble: 4 × 4
country gdp_2018 gdp_2019 gdp_2020
<chr> <dbl> <dbl> <dbl>
1 United States 62887 65298 63544
2 United Kingdom 42943 42330 40285
3 Germany 47603 46468 45724
4 Japan 39287 40247 39539
The key arguments for pivot_wider() are:
names_fromspecifies which column contains the values that will become the new column names.values_fromspecifies which column contains the values that will fill those new columns.names_prefix(optional) adds a prefix to the new column names. Without it, the columns would just be named2018,2019,2020—which are valid R column names but harder to work with because they start with numbers.
C.5.4 Another pivot_wider() Example
Suppose you have panel data on birth outcomes and want to create a table comparing states:
birth_long <- tibble(
state = rep(c("Maine", "Vermont", "New Hampshire"), each = 3),
year = rep(2018:2020, times = 3),
lbw_rate = c(7.1, 7.3, 7.0,
6.8, 6.5, 6.9,
6.2, 6.4, 6.1)
)
birth_long# A tibble: 9 × 3
state year lbw_rate
<chr> <int> <dbl>
1 Maine 2018 7.1
2 Maine 2019 7.3
3 Maine 2020 7
4 Vermont 2018 6.8
5 Vermont 2019 6.5
6 Vermont 2020 6.9
7 New Hampshire 2018 6.2
8 New Hampshire 2019 6.4
9 New Hampshire 2020 6.1
To create a wide table comparing states across years:
birth_wide <- birth_long |>
pivot_wider(
names_from = year,
values_from = lbw_rate
)
birth_wide# A tibble: 3 × 4
state `2018` `2019` `2020`
<chr> <dbl> <dbl> <dbl>
1 Maine 7.1 7.3 7
2 Vermont 6.8 6.5 6.9
3 New Hampshire 6.2 6.4 6.1
This wide format is convenient for a presentation table where you want to compare across years at a glance.
A helpful way to remember: pivot_longer() makes your data longer (more rows, fewer columns), and pivot_wider() makes your data wider (fewer rows, more columns). If your column names contain data values (like years), you probably need pivot_longer(). If you want to spread values across columns, you need pivot_wider().
C.6 Exporting Data
After cleaning, reshaping, or creating a new dataset, you’ll often want to save it for future use or to share with others. The tidyverse provides straightforward functions for exporting data.
C.6.1 Writing CSV Files
The most common export format is CSV, because it can be opened by virtually any software:
# Write a data frame to a CSV file
write_csv(gdp_long, "gdp_data_clean.csv")This creates a file called gdp_data_clean.csv in your working directory. The file will contain the data in plain text with commas separating the values—no special formatting, no formulas, just clean data.
C.6.2 Writing R Data Files
If you’re saving data that will only be used in R, the .rds format is preferable because it preserves all of R’s data types (factors, dates, etc.) and is more compact than CSV:
# Save as an RDS file
write_rds(gdp_long, "gdp_data_clean.rds")To read it back in later:
gdp_long <- read_rds("gdp_data_clean.rds")C.6.3 Writing Stata Files
If you need to share data with someone using Stata, the haven package can write .dta files:
library(haven)
write_dta(gdp_long, "gdp_data_clean.dta")C.6.4 Writing Excel Files
To export to Excel format, you can use the writexl package:
library(writexl)
write_xlsx(gdp_long, "gdp_data_clean.xlsx")| Format | Function | Package |
|---|---|---|
| CSV | write_csv() |
readr (tidyverse) |
| TSV | write_tsv() |
readr (tidyverse) |
| RDS | write_rds() |
readr (tidyverse) |
| Stata (.dta) | write_dta() |
haven |
| Excel (.xlsx) | write_xlsx() |
writexl |
C.7 Summary and Conclusion
This appendix covered the essential workflow for getting data into and out of R and understanding its structure along the way.
C.7.1 Key Takeaways
Documentation first: Always look for README files and codebooks before analyzing a dataset. Understanding what your variables measure and how they’re coded prevents errors downstream.
Import with the right function: Use
read_csv()for CSV files,read_excel()for Excel files,read_dta()for Stata files, andread_rds()for R data files. Thetidyverseand its companion packages cover virtually every common file format.Explore before you analyze: Run
head(),glimpse(), andsummary()on every new dataset. Check for missing values and verify that variable types and ranges make sense. This quick health check catches problems early.Know wide vs. long: Wide data has one row per unit with separate columns for time periods; long data has one row per unit-time observation. Most R analysis functions expect long data.
Reshape with pivot functions: Use
pivot_longer()to go from wide to long andpivot_wider()to go from long to wide. The key arguments arecols,names_to/names_from, andvalues_to/values_from.Export in the right format: Use
write_csv()for universal compatibility,write_rds()for R-only workflows, and format-specific functions likewrite_dta()when sharing with colleagues who use other software.
C.8 Check Your Understanding
For each question below, select the best answer from the dropdown menu. The dropdown will turn green if correct and red if incorrect. Click the “Show Explanation” toggle to see a full explanation of the answer after attempting each question.
The tidyverse function for reading CSV files is
read_csv()(with an underscore), from thereadrpackage. Note that base R has a similar function calledread.csv()(with a period), butread_csv()is faster, handles column types more intelligently, and produces tibbles rather than basic data frames.glimpse()provides a compact overview of a data frame showing every column’s name, its data type (e.g.,<dbl>,<chr>,<fct>), and the first several values. This makes it one of the most efficient ways to quickly understand a dataset’s structure.head()shows rows, andsummary()shows statistics.This is wide format data. The giveaway is that the column names contain data values (the years 2018, 2019, 2020). In long format, there would be a single
yearcolumn and a singlepopulationcolumn, with each state-year combination as its own row.pivot_longer()converts wide data to long data by stacking multiple columns into two new columns (one for the names, one for the values).pivot_wider()does the opposite. The older functionsgather()andspread()did similar things but have been superseded by the pivot functions.The
values_toargument specifies the name of the new column that will contain the data values from the pivoted columns. For example, if you’re pivoting columnsgdp_2018,gdp_2019, andgdp_2020, the values in those columns will all go into the column you name invalues_to. Thenames_toargument handles the column names, andcolsspecifies which columns to pivot.RDS files preserve all of R’s data types, including factors, dates, and other special types. When you save a factor as CSV and read it back, it becomes a character variable and you have to convert it again. RDS files also tend to be smaller because they use compression. However, RDS files can only be read by R, which is a trade-off compared to the universally readable CSV format.
In
pivot_wider(),names_fromspecifies the column whose values will become the new column names (here,yearvalues like 2018, 2019, 2020 become column names), andvalues_fromspecifies the column whose values will fill those new columns (here, thegdpvalues). Note thatpivot_wider()usesnames_from/values_from, whilepivot_longer()usesnames_to/values_to—the “from” and “to” suffixes indicate the direction of the transformation.