8 Heteroskedasticity
- What is homoskedasticity and why is it a Gauss-Markov assumption?
- What happens to OLS estimates when heteroskedasticity is present?
- How does heteroskedasticity affect our hypothesis tests?
- What are heteroskedasticity-robust standard errors and when should we use them?
- How can we detect heteroskedasticity in our data?
- Wooldridge (2019), Ch. 8
One of the key assumptions underlying OLS is that the variance of the error term is constant across all values of the explanatory variables. When this assumption fails, we have heteroskedasticity—and while our coefficient estimates remain unbiased, our standard errors become unreliable. This chapter explores what heteroskedasticity is, why it matters, and how to fix it.
8.1 The Homoskedasticity Assumption
8.1.1 Gauss-Markov Assumption 5
Recall the Gauss-Markov assumptions that make OLS the Best Linear Unbiased Estimator (BLUE). The fifth assumption is homoskedasticity:
The error term \(\mu\) has the same variance given any value of the explanatory variables: \[Var(\mu | x_1, x_2, ..., x_k) = \sigma^2\] The variance is constant—it doesn’t depend on \(x\).
This assumption is crucial because it allows us to derive the correct formula for the variance of \(\hat{\beta}\), which in turn gives us valid standard errors, t-statistics, and p-values.
8.1.2 What Does Homoskedasticity Look Like?
In a homoskedastic regression, the spread of observations around the regression line is constant across all values of \(x\).
Notice how the “cone” of data points has roughly the same width whether \(x\) is small or large. This is homoskedasticity in action.
8.2 Heteroskedasticity: When Variance Changes
8.2.1 What Is Heteroskedasticity?
Heteroskedasticity occurs when the variance of the error term changes with the level of the explanatory variable: \[Var(\mu_i | x_i) = \sigma_i^2\]
The subscript \(i\) indicates that variance depends on observation \(i\)’s value of \(x\).
8.2.2 What Does Heteroskedasticity Look Like?
The classic pattern is a “fan” or “cone” shape, where the spread of data increases (or decreases) as \(x\) changes:
This pattern is extremely common in economic data. For example:
- Income and consumption: Higher-income households have more variable consumption patterns
- Firm size and profits: Larger firms have more variable profit margins
- Experience and wages: More experienced workers have more variable wages
8.2.3 Side-by-Side Comparison
8.3 Why Does Heteroskedasticity Matter?
8.3.1 The Good News: Coefficients Are Still Unbiased
Here’s the crucial point: heteroskedasticity does NOT bias our coefficient estimates. The unbiasedness of OLS depends only on Gauss-Markov assumptions 1-4, not on homoskedasticity.
So \(E[\hat{\beta}] = \beta\) still holds. Our estimates are centered on the truth.
8.3.2 The Bad News: Standard Errors Are Wrong
However, the standard error formulas we’ve been using assume homoskedasticity. When this assumption is violated:
- The formula \(SE(\hat{\beta}_1) = \frac{\hat{\sigma}}{\sqrt{SST_x}}\) is invalid
- Our computed standard errors will typically be too small
- This means t-statistics are too large
- And p-values are too small
The result? We reject the null hypothesis too often, even when it’s true.
8.3.3 Type I Error: A Quick Review
A Type I error (false positive) occurs when we reject a true null hypothesis.
When we set \(\alpha = 0.05\), we’re accepting a 5% probability of making this error. If our test is working correctly, we should reject true null hypotheses exactly 5% of the time.
With heteroskedasticity and standard OLS standard errors, the actual Type I error rate can be much higher than 5%—sometimes 10%, 15%, or even higher.
8.4 Simulation: Seeing the Problem
8.4.1 Setup: Simulating Hypothesis Tests
Let’s run a simulation to see exactly what goes wrong. We’ll generate data where the true effect is zero (\(\beta_1 = 0\)), then see how often we incorrectly reject this true null hypothesis.
First, let’s see what happens under homoskedasticity (where everything works correctly):
# Simulation parameters
n_sims <- 2000 # Number of simulations
n_obs <- 250 # Observations per sample
beta0_true <- 5 # True intercept
beta1_true <- 0 # True slope (NULL HYPOTHESIS IS TRUE!)
set.seed(42)
# Function for ONE simulation trial under HOMOSKEDASTICITY
run_homo_trial <- function(i) {
x <- runif(n_obs, 1, 50)
error <- rnorm(n_obs, mean = 0, sd = 10) # Constant variance
y <- beta0_true + beta1_true * x + error
model <- lm(y ~ x)
# Get OLS p-value
pval_ols <- summary(model)$coefficients["x", "Pr(>|t|)"]
return(data.frame(pval_ols = pval_ols))
}
# Run simulations
homo_results <- map_dfr(1:n_sims, run_homo_trial)
# Calculate rejection rate (Type I error rate)
homo_reject_rate <- mean(homo_results$pval_ols < 0.05)
Under homoskedasticity, the Type I error rate is 5.7%—almost exactly 5%, as expected.
8.4.2 Now With Heteroskedasticity
Now let’s see what happens when we have heteroskedasticity but use standard OLS standard errors:
# Function for ONE simulation trial under HETEROSKEDASTICITY
run_hetero_trial <- function(i) {
x <- runif(n_obs, 1, 50)
# Variance INCREASES with x (heteroskedasticity!)
error <- rnorm(n_obs, mean = 0, sd = sqrt(0.2 * x^3))
y <- beta0_true + beta1_true * x + error
model <- lm(y ~ x)
# Get both OLS and robust p-values
res_ols <- coeftest(model)
res_robust <- coeftest(model, vcov. = vcovHC(model, type = "HC1"))
return(data.frame(
pval_ols = res_ols["x", "Pr(>|t|)"],
pval_robust = res_robust["x", "Pr(>|t|)"]
))
}
# Run simulations
hetero_results <- map_dfr(1:n_sims, run_hetero_trial)
# Calculate rejection rates
ols_reject_rate <- mean(hetero_results$pval_ols < 0.05)
robust_reject_rate <- mean(hetero_results$pval_robust < 0.05)
The difference is stark! With standard OLS standard errors (left panel), we reject the true null hypothesis 10.6% of the time—more than twice our intended 5% rate. This is a serious problem: we’re finding “significant” effects that don’t exist.
With heteroskedasticity-robust standard errors (right panel), the rejection rate returns to approximately 5%.
8.5 Heteroskedasticity-Robust Standard Errors
8.5.1 The Solution
The fix is remarkably simple: use a different formula for the standard errors that accounts for heteroskedasticity. These are called heteroskedasticity-robust standard errors (also known as White standard errors or Huber-White standard errors).
The standard OLS variance formula is: \[Var(\hat{\beta}_1) = \frac{\sum_{i=1}^{n}\hat{u}_i^2 / (n-2)}{SST_x^2}\]
The heteroskedasticity-robust formula is: \[Var(\hat{\beta}_1) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2\hat{u}_i^2}{SST_x^2}\]
The key difference is that robust standard errors weight each squared residual by how far its \(x_i\) is from the mean. This accounts for the fact that residuals may have different variances at different values of \(x\).
8.5.2 Visualizing the Sampling Distributions
The figure shows that:
- The actual sampling distribution (blue) has heavier tails than OLS standard errors predict
- Standard OLS SEs (red dashed) assume a narrower distribution than reality
- Robust SEs (green) correctly capture the true variability
8.5.3 Comparing Type I Error Rates
8.6 Implementing Robust Standard Errors in R
8.6.1 Using fixest
The easiest way to get robust standard errors is with the fixest package, which we’ll use throughout this course:
# Generate some heteroskedastic data
set.seed(123)
example_data <- tibble(
x = runif(200, 1, 50),
y = 5 + 0.5 * x + rnorm(200, 0, sqrt(0.1 * x^2))
)
# Standard OLS (wrong SEs with heteroskedasticity)
model_ols <- feols(y ~ x, data = example_data)
# With heteroskedasticity-robust SEs
model_robust <- feols(y ~ x, vcov = "HC1", data = example_data)# Compare the results
modelsummary(
list("Standard SE" = model_ols, "Robust SE" = model_robust),
stars = TRUE,
gof_map = c("nobs", "r.squared")
)| Standard SE | Robust SE | |
|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
| (Intercept) | 5.353*** | 5.353*** |
| (1.216) | (0.906) | |
| x | 0.482*** | 0.482*** |
| (0.042) | (0.045) | |
| Num.Obs. | 200 | 200 |
| R2 | 0.402 | 0.402 |
Notice that:
- The coefficients are identical (unbiasedness is not affected)
- The standard errors differ (robust SEs are typically larger)
- This affects t-statistics and p-values
8.6.2 Best Practice: Always Use Robust Standard Errors
Always report heteroskedasticity-robust standard errors.
Why? Because:
- If heteroskedasticity is present, robust SEs are correct and standard SEs are wrong
- If homoskedasticity holds, robust SEs are still valid (just slightly less efficient)
You can’t lose by using robust SEs, but you can lose badly by not using them.
8.7 Detecting Heteroskedasticity
8.7.1 Residual Plots
The best way to detect heteroskedasticity is to look at your residuals. Plot the residuals against the fitted values (or against each explanatory variable) and look for patterns:
# Fit model to heteroskedastic data
model <- lm(y ~ x, data = example_data)
# Add residuals and fitted values to data
example_data <- example_data |>
mutate(
fitted = fitted(model),
resid = residuals(model)
)
8.7.2 What to Look For
Homoskedasticity: Residuals form a random cloud with constant spread
Heteroskedasticity: Look for:
- Fan/cone shapes (spread increases or decreases)
- Patterns in the spread related to fitted values
- Different variability for different groups
8.7.3 Formal Tests
While visual inspection is usually sufficient, you can also use formal tests like the Breusch-Pagan test:
# Breusch-Pagan test
bptest(model)
studentized Breusch-Pagan test
data: model
BP = 29.91, df = 1, p-value = 0.00000004526
A small p-value (< 0.05) suggests heteroskedasticity is present. However, don’t rely solely on formal tests—always look at your residual plots too.
8.8 Summary
Heteroskedasticity is a common issue in economic data where the variance of errors changes with the explanatory variables.
Key Points:
- Heteroskedasticity does NOT bias coefficients - \(E[\hat{\beta}] = \beta\) still holds
- Heteroskedasticity DOES invalidate standard errors - leading to incorrect inference
- The main symptom is inflated Type I error rates - we reject true nulls too often
- The solution is simple: use robust standard errors - specify
vcov = "HC1"infeols() - Detect heteroskedasticity with residual plots - look for fan/cone patterns
Best Practice: Always use heteroskedasticity-robust standard errors. They’re valid whether or not heteroskedasticity is present, so there’s no downside to using them.
8.9 Check Your Understanding
For each question below, select the best answer from the dropdown menu.
Heteroskedasticity means the variance of the error term is not constant—it changes depending on the value of x. For example, variance might increase with x, creating a “fan” pattern in the data.
No, heteroskedasticity does NOT cause bias in coefficient estimates. Unbiasedness depends only on Gauss-Markov assumptions 1-4 (linearity, random sampling, no perfect collinearity, zero conditional mean). The homoskedasticity assumption (GM5) affects efficiency and inference, not unbiasedness.
When heteroskedasticity is present but ignored, the standard error formula is wrong. Typically, standard errors are underestimated, making t-statistics too large and p-values too small. This leads to rejecting true null hypotheses too often (inflated Type I error rate).
The classic sign of heteroskedasticity is a “fan” or “cone” shape in the residual plot, where the spread of residuals increases (or decreases) as the fitted values change. Constant spread would indicate homoskedasticity.
Robust standard errors are valid whether or not heteroskedasticity exists. If there’s no heteroskedasticity, they’re slightly less efficient but still correct. If there IS heteroskedasticity, they’re correct while standard SEs are wrong. There’s no downside to using them.
When robust SEs are larger than standard OLS SEs, it typically indicates heteroskedasticity is present. The standard OLS formula is underestimating the true variability in β̂, while the robust formula correctly accounts for the non-constant variance. This is a common pattern and exactly why robust SEs are recommended.