1  Introduction

  • What is the difference between correlation and causation?
  • What does an econometrics research project entail?
  • What are the main types of data formats we will work with?
  • What is statistical software and why is it useful?
  • Wooldridge (2019), Ch. 1
  • Wickham, Çetinkaya-Rundel, and Grolemund (2023), Ch. 1

1.1 What is this Class About?

This class is obviously dedicated to the study of econometrics (endearingly shortened `metrics). But what exactly does econometrics entail? What differentiates it from broader terms such as “statistical analysis” or “data analysis”?

From Wooldridge (2019), we get a classic definition of econometrics:

“…statistical methods for estimating economic relationships, testing economic theories, and evaluating and implementing government and business policy.”

However, my view of econometrics is at the same time, broader and more narrow:

Definition of Econometrics

The practice of using statistical, quantitative methods to study social phenomena and provide insights into public policy. More specifically, modern econometrics has focused on identifying the causal effects of social phenomena/policy as opposed to correlations.

In other words, I view the practice of econometrics as the use of quantitative methods to understand the social world around us, with a special (though not exclusive) lens focusing on separating correlation from causation.

1.2 Does Having Health Insurance Cause You to be Healthier?

Let’s think about a critical question in public policy: does having health insurance make someone healthier? This is especially an important question in the context of the United States, which does not have a public health insurance system. Instead, it is a predominantly private system entangled with a complex net of social safety nets meant to provide a patchwork of coverage to the most marginalized.

However, as of 2023, 9.5% of U.S. adults do not have any health insurance coverage.1 Would the US as a whole be better off guarantee health insurance coverage for everyone? If health insurance makes people healthier, then there may be a strong case to do so. If not, the benefits of such a system would become, at the very least, more opaque.

Let’s take a look at life expectancy in the US (a course but important measure of population health), relative to similar high-income, developed countries. Figure 1.1 shows the average life expectancy from 1990 to 2021 for the US, UK, Canada, Australia, and a number of other high-income countries. As can be seen, the US performs considerably worse in terms of life expectancy, and this divergence between peer countries has gotten larger since the 1990s, and especially during the COVID-19 pandemic.

Figure 1.1: Life expectancy in the US vs. peer countries.

Critically—with the exception of the US—all of the countries in the peer group have some form of universal health care. In other words, all developed countries which are healthier than the US have policies which guarantee the uninsured rate is zero.

This then leads to a critical question: did universal health insurance cause these countries to have better health outcomes? Or is this a correlation between policy and outcomes? Certainly, one could make an argument that having health insurance does cause better health outcomes. On the other hand, there are lots of differences between the US and other high-income countries which may also explain the pattern seen in Figure 1.1. For example, maybe these countries have different regulations regarding the food quality, which may explain part of the difference in life expectancy. Or maybe because relatively more people commute to work via bicycle or walking, which may have health benefits compared to driving to work.

These other differences between groups which may confound our comparison between countries with universal health insurance and those without also extends to individuals. In the US at least, those with health insurance may be more likely to have a high-paying job which includes insurance as a benefit. Therefore, a comparison between those who have insurance and those who do not will be plagued with the fact that those who have insurance also are more likely to have a high income job, who may have had better health outcomes regardless of if they had insurance or not. In other words, we may be confusing correlation for **causation*.

Correlation vs. Causation

The correlation between two variables describes how they move together. It means two things are associated. Causation describes a direct relationship where a change in one variable causes a change in another. It means one thing makes the other happen.

A key part of the content in this course is to think about when our comparisons are describing a correlative relationship or a causal relationship. We will study techniques which—if one is convinced that the requisite assumptions hold—will actually be able to recover the causal effect of social phenomena, and bring important insights into essential public policy questions like: what is the effect of health insurance on health outcomes? Does going to college cause you to have higher wages? Do immigrants reduce native wages? and does policing cause decreases in crime?

1.3 Broad Steps of Econometric Analysis

Let’s say we want to know what the effect of on-the-job training has on your wage. What steps would we need to take to answer this question?

1.3.1 Step 1: Write the Model

Our first step would be to write an econometric model, with the goal of using data and econometric tools to estimate the parameters of the model. This model can sometimes be derived from a formal model in economics (i.e., the profit maximization of firms), but it is at least based on some intuitive logic which mathematically relates some variable (the dependent variable) to some independent variable(s).

An example of the type of econometric model you may want to run to study the effect of on-the-job training is:

\[ wage = \beta_0 + \beta_1(training) + \mu, \tag{1.1}\]

where \(wage\) is the outcome, left-hand side (LHS), predicted, or dependent variable (these are synonyms for the same thing), and \(training\) is the explanatory, right-hand side (RHS), predictor, or independent variable (again, synonyms for the same thing). The \(\beta_0\) and \(\beta_1\) variables are the parameters that we want to estimate the value of using econometric methods, and they tell us how our independent variable is related mathematically to our dependent variable.

1.3.2 Step 2: Find and Explore the Data

The next thing you have to do—after we figure out the model we want to estimate, such as in Equation 1.1—is to get the data. There is a huge range of datasets out there, and we are especially lucky that the data-gathering infrastructure in the United States is incredibly robust. As we go through the semester, we will work with as many different datasets that we can, which cover a range of topics: from sports to health, crime, and economic development. While different datasets usually can their own quirks, broadly speaking there are three different types of data we will work with in this class:

  1. Cross-sectional data: This is data which contains information on units (e.g., individuals, countries, firms, households, etc.) at a single point in time. This data is the most common, and also is generally the simplest to work with, so we will start out by working mostly with cross-sectional data.

  2. Time-series data: This data consists of observations on a single unit (e.g., a country’s Gross Domestic Product (GDP), a specific company’s stock price, or the national unemployment rate) across multiple time periods (e.g., years, quarters, months, or even minutes). In time series data, a crucial characteristic is the temporal ordering of the observations. This structure is often used for forecasting and modeling how a variable’s past values influence its present and future values.

  3. Panel (or Longitudinal) data: Panel data combines both cross-sectional and time-series dimensions. It consists of observations on multiple units (e.g., individuals, firms, or countries) over multiple time periods. It essentially tracks the same set of cross-sectional units over time. This gives it more information, more variability, and allows researchers to use a rich set of methods to really hone in on the causal effects of whatever social phenomena being studied.

In all of these cases, econometric data is organized in what is called a matrix or a dataframe. It should look roughly similar to an Excel Spreadsheet, where each column represents a different variable (which is usually labelled, sometimes not), and each row represents the values that a specific unit has across these variables.

For example, the data we use to estimate the model in Equation 1.1 may look like Table 1.1,

Table 1.1: Relationship between training and wages for 10 workers
person_name person_id wage yrs_training
Paul 1 18.50 1.00
Ringo 2 22.00 3.00
George 3 17.65 2.50
John 4 27.95 8.20
Linda 5 19.20 0.50
Yoko 6 24.80 6.80
Pattie 7 20.50 5.50
Maureen 8 23.75 4.20
Cynthia 9 16.40 0.80
Barbara 10 28.50 9.10

where the person_name column is (obviously) each person’s name, person_id column corresponds the unique numerical identifier associated with each person (this will be more common to see in datasets since names are usually anonymous), the wage column shows each person’s hourly wage in dollars, and the yrs_training column shows the number of years each person received on-the-job training. This row-column structure of organizing data is so fundamental that the word column is often interchanged with the word variable.

When you first look at some data, it is really important to figure out what each row represents. In the simple example shown in Table 1.1, this is straightforward since each row represents a different person. However, in more complex datasets, it may not be immediately obvious what each row represents. For example, in panel data, each row may represent a specific person at a specific point in time. So row 1 may be Paul in year 2020, row 2 may be Paul in year 2021, row 3 may be Ringo in year 2020, and so on. Figuring out what each row represents is critical to understanding the data, as well as how to appropriately analyze it.

Once we have the data in hand, then we may need to do some data cleaning and organization. This may involve: - Removing missing values - Creating new variables - Changing the format of variables (e.g., from string to numeric) - Merging datasets together - Filtering the data to only include certain observations - And many other tasks.

These are all tasks that in many cases take up the bulk of the time in an econometric research project, and while not the most glamorous part of econometrics, it is absolutely essential to get right.

Before moving on to estimation, it is also important to do some exploratory data analysis, (which should be considered step 2.5 in the econometric research process). This involves looking at summary statistics of the data (means, medians, standard deviations, etc.), as well as visualizations (histograms, scatter plots, etc.). Exploratory data analysis is important for understanding the data, as well as for identifying potential issues with the data (e.g., outliers, missing values, etc.).

One of the most important parts of exploratory data analysis is to look at the relationship between the dependent variable and independent variable(s) in your model. One way this can be done is using a scatterplot, which plots each observation in the dataset using the independent and dependent variables as coordinates, with the independent variable on the x-axis and the dependent variable on the y-axis.

Figure 1.2: Relationship between training and wages for 10 workers

In Figure 1.2, we see a scatterplot of the relationship between years of training and wages for the 10 workers shown in Table 1.1. From this plot, we can see that there is a positive relationship between years of training and wages: as years of training increases, wages also tend to increase. While not good enough to draw any strong conclusions, one can see how this type of exploratory data analysis is useful for understanding the data and the relationship between variables.

1.3.3 Step 3: Estimate the Model

Once we have a model in mind as well as the appropriate data, we can then use our econometric methods to estimate the relationship. The bulk of this class is dedicated to this step in the econometric research process, but the other stages are no less important! The most common way that econometric models are estimated is with Ordinary Least Squares (OLS), which you may remember from your pre-req. In addition to estimating the actually value of the parameters in the model, we also want some idea of how uncertain we are about the estimated parameter value. This is where notions of statistical significance and confidence intervals come in.

1.3.4 Step 4: Interpret the Estimate

Once we have an estimated number for the parameter in our model, we then have to interpret the number. Here, the phrase “interpret the estimate” means two parallel things. First, there is the literal, mathematical interpretation of the model. This will change depending on the functional form we use to estimate the model, and we will spend lots of time learning about this. The second part of “interpreting” the model is about what the results mean in the context of your broader research question. Is your estimate big or small? Is it economically important? Do you actually think it reflects a causal effect? These are all types of interpretation which are a critical part of econometrics research, but for which there is not usually a formula we can use. A large part of the “art” working on an econometrics project is exactly this type of interpretation, and how you form it into a compelling narrative. While there are some tools we will cover which will help in this regard, in my opinion, nothing will help you more with this than just reading lots of good, well-written econometrics papers.

1.4 Statistical Software

Much of the econometric process is done using specialized statistical software. We will use this software to: “clean” our data and organize in the appropriate row-column format to do our econometric estimations; explore our data using descriptive statistics (means, medians, etc.) and visualizations (an increasingly important part of a compelling econometric project); estimate our model and conduct analyses of statistical uncertainty; and output our results to easy-to-interpret tables and figures.

There are a few commonly used econometrics software:

  • Stata: Proprietary software purpose-built for econometrics
  • R: Open-source programming language with a focus on statistical analysis
  • Python: Open-source programming language with a focus on data science more broadly
  • EViews: Proprietary software with a focus on forecasting and other time series applications

Each one of these software of their own pros and cons, and it is worth knowing several quite well. For the purposes of this class, we will learn the R programming language to do econometrics. If you haven’t take the time to review Appendix A to learn more about R, how to install it, and to run your first chunk of code!

1.5 Summary and Conclusion

This chapter introduced the fundamental framework that will guide our study of econometrics throughout this course. We defined econometrics as the use of quantitative methods to understand social phenomena and inform public policy, with a particular focus on distinguishing between correlation and causation. Through the example of health insurance and health outcomes, we saw how patterns in data can be misleading. While the United States lags behind peer countries in life expectancy, and those peer countries all have universal health insurance, we cannot immediately conclude that universal coverage causes better health outcomes. The tools you will learn in this course are designed to help you identify causal relationships rather than mere associations.

1.5.1 Key Takeaways:

  • Econometrics vs. correlation: Modern econometrics focuses on identifying causal effects, not just correlations between variables

  • Four steps of econometric research:

    1. Write an econometric model
    2. Find and explore the appropriate data
    3. Estimate the model parameters
    4. Interpret the results
  • Three main data types: Cross-sectional (single point in time), time-series (one unit over time), and panel data (multiple units over time)

  • Statistical software: We will use R to clean data, estimate models, and present results

As we move forward, remember that econometrics is as much an art as it is a science. The mathematical tools we will learn are powerful, but they are only as good as the questions we ask and the care with which we interpret our results.

1.6 Check Your Understanding

For each question below, select the best answer from the dropdown menu. The dropdown will turn green if correct and red if incorrect. Click the “Show Explanation” toggle to see a full explanation of the answer after attempting each question.

  1. Cities may hire more police because they have high crime (reverse causality), or other factors like population density may drive both police presence and crime rates. This illustrates why correlation does not imply causation.

  2. This tracks a single unit (France) across multiple time periods (1990–2023), which defines time-series data. Cross-sectional data would be multiple countries at one point in time; panel data would be multiple countries over multiple years.

  3. The dependent variable (wage) is what we’re trying to explain—it goes on the left-hand side. The independent variable (training) is what we think affects it—it goes on the right-hand side.

  4. The four steps of econometric research are: (1) write the model, (2) find and explore data, (3) estimate the model, (4) interpret results. Exploratory data analysis—looking at summary statistics and visualizations—is critical before estimation.

  5. Those with insurance often have higher incomes, more education, and better access to other resources—factors that independently improve health. Without accounting for these confounders, we cannot separate the effect of insurance from the effect of these other characteristics. This is exactly why econometric methods for causal inference exist.


  1. Source: KFF↩︎