What are the assumptions of ordinary least squares (OLS) in econometrics

What are the assumptions of ordinary least squares (OLS) in econometrics

Ordinary Least Squares (OLS) is the simplest and most widely used estimation technique in econometrics. It provides the Best Linear Unbiased Estimators (BLUE) for the parameters in a linear regression model, provided a specific set of assumptions, known as the Classical Linear Model (CLM) assumptions, are met.

When these assumptions hold, the OLS estimates are considered reliable and trustworthy for statistical inference. When they are violated, the results can be misleading, biased, or inefficient.

Here are the nine core assumptions of OLS, grouped by their impact on your model.

Group 1: Assumptions for Linearity and Data Quality

These assumptions ensure the model is correctly specified and the data is appropriate.

1. Linearity in Parameters

The relationship between the dependent variable and the independent variables must be linear in the parameters.

  • Formal Statement:
  • Implication: This does not mean the variables themselves must be linear; you can use transformations like or. It just means the coefficients must appear linearly.

2. Random Sampling

The data used in the model must be a random sample from the population.

  • Implication: This ensures that the data is representative of the population you are trying to make inferences about, which is necessary for the estimators to be unbiased.

3. No Perfect Collinearity

None of the independent variables should be a perfect linear function of any other independent variable.

  • Implication: If two variables are perfectly correlated (e.g., measuring height in inches and height in centimeters), the OLS model cannot uniquely determine the separate effect of each variable, leading to a computational breakdown. Multicollinearity (high, but not perfect, correlation) is a lesser, but still problematic, form of this violation.

4. Sufficient Variation in X

There must be some variation in the independent variables across your sample.

  • Implication: If doesn’t vary (e.g., if you only sampled people aged 30), you cannot measure how changes in affect.

Group 2: Assumptions Concerning the Error Term

These are the most critical assumptions for the statistical properties of the OLS estimators.

5. Zero Conditional Mean (Exogeneity)

The error term has an expected value of zero conditional on the explanatory variables.

  • Formal Statement:
  • Implication: This is the most crucial assumption, often referred to as exogeneity. It means the independent variables are uncorrelated with the unobserved factors (the errors). Violation of this assumption leads to endogeneity (e.g., omitted variable bias, measurement error, or reverse causality), making the OLS estimates biased and inconsistent.

6. Homoskedasticity (Constant Variance)

The variance of the error term must be constant across all observations of the independent variables.

  • Formal Statement:
  • Implication: Violation of this assumption is called heteroskedasticity. While heteroskedasticity does not bias the coefficient estimates, it makes the standard errors incorrect. This means your -statistics and confidence intervals are unreliable, leading to faulty hypothesis testing.

7. No Autocorrelation (Serial Correlation)

The error terms for different observations must be uncorrelated with each other.

  • Formal Statement: for
  • Implication: Violation of this assumption, often called serial correlation, is common in time series data (where a shock today affects tomorrow’s error). Like heteroskedasticity, it does not bias the coefficients but renders the standard errors incorrect, leading to inefficient estimators.

Group 3: The Normality Assumption

This assumption is not necessary for the OLS coefficients to be BLUE but is required for performing hypothesis tests in small samples.

8. Normal Distribution of Errors

The error term must be normally distributed around its mean of zero.

  • Implication: If the sample size is large enough, the Central Limit Theorem (CLT) ensures that the sampling distribution of the OLS estimators is approximately normal, even if the errors themselves aren’t. Thus, for large samples, this assumption is often considered less critical for reliable inference.

9. Correct Functional Form (Implied)

Although often stated separately, a common implied assumption is that the functional form of the model is correct (i.e., you haven’t excluded relevant variables or included irrelevant ones).

  • Implication: Specifying the wrong functional form or omitting key variables can lead directly to the violation of the Zero Conditional Mean assumption (Assumption 5), resulting in biased estimates.

Summary of Consequences

The core benefit of OLS estimators being BLUE means they are:

  • Best (Efficient): They have the smallest variance among all linear unbiased estimators.
  • Linear: They are a linear function of the dependent variable.
  • Unbiased: On average, the estimated coefficients equal the true population parameters.

If the assumptions are violated, you must use alternative techniques:

  • For Endogeneity (violation of 5): Use Instrumental Variables (IV) or Two-Stage Least Squares (2SLS).
  • For Heteroskedasticity (violation of 6): Use Robust Standard Errors (e.g., White’s correction).
  • For Autocorrelation (violation of 7): Use Newey-West Standard Errors or estimate a Generalized Least Squares (GLS) model.

By testing for these violations and employing the appropriate remedies, you ensure that your OLS-based analysis provides sound and reliable conclusions.