4 Week 3: Distribution Theory of OLS and Inference

In this week, we study the sampling distribution of the ordinary least squares estimator under the normal linear model. This allows us to quantify uncertainty, estimate the error variance, construct confidence intervals, perform hypothesis tests, and distinguish between inference for the mean response and prediction for a future observation.

4.1 Learning Objectives

By the end of this week, students should be able to:

state the normal linear model;
derive the distribution of the OLS estimator;
obtain an unbiased estimator of \(\sigma^2\);
understand the role of chi-square, \(t\), and \(F\) distributions in linear regression;
construct confidence intervals for regression coefficients and mean responses;
perform hypothesis tests for individual coefficients and general linear hypotheses;
distinguish between confidence intervals for the mean response and prediction intervals for a new observation.

4.2 Reading

5 1. Review of the Linear Model

Recall the linear model

\[ \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \]

where

\(\mathbf{Y}\) is an \(n \times 1\) response vector,
\(\mathbf{X}\) is an \(n \times p\) design matrix,
\(\boldsymbol{\beta}\) is a \(p \times 1\) unknown parameter vector,
\(\boldsymbol{\varepsilon}\) is an \(n \times 1\) error vector.

From Week 2, when \(\mathbf{X}\) has full column rank, the ordinary least squares estimator is

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{Y}. \]

We also know that

\[ \mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}, \qquad \mathrm{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top \mathbf{X})^{-1}, \]

provided that

\[ \mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0}, \qquad \mathrm{Var}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}_n. \]

To obtain exact finite-sample inference, we now strengthen the model assumptions.

6 2. The Normal Linear Model

We assume

\[ \mathbf{Y} \sim N_n(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n). \]

Equivalently,

\[ \boldsymbol{\varepsilon} \sim N_n(\mathbf{0}, \sigma^2 \mathbf{I}_n). \]

This is called the normal linear model.

Under this assumption, exact sampling distributions can be derived for the OLS estimator, residual sum of squares, and many test statistics.

6.1 Why normality matters

Without normality, OLS is still unbiased under the standard moment assumptions, but exact \(t\) and \(F\) inference generally no longer holds in finite samples.

Normality gives us:

exact distribution of \(\hat{\boldsymbol{\beta}}\);
exact chi-square distribution for the residual sum of squares;
exact \(t\) and \(F\) tests.

7 3. Distribution of the OLS Estimator

Since

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{Y}, \]

the estimator is a linear transformation of \(\mathbf{Y}\).

Because a linear transformation of a multivariate normal vector is again multivariate normal, we immediately obtain:

\[ \hat{\boldsymbol{\beta}} \sim N_p\!\left( \boldsymbol{\beta}, \sigma^2(\mathbf{X}^\top \mathbf{X})^{-1} \right). \]

Thus,

the mean of \(\hat{\boldsymbol{\beta}}\) is \(\boldsymbol{\beta}\);
the covariance matrix of \(\hat{\boldsymbol{\beta}}\) is \(\sigma^2(\mathbf{X}^\top \mathbf{X})^{-1}\).

7.1 Individual coefficients

For the \(j\)th coefficient,

\[ \hat{\beta}_j \sim N\!\left(\beta_j, \sigma^2 c_{jj}\right), \]

where \(c_{jj}\) is the \(j\)th diagonal element of \((\mathbf{X}^\top \mathbf{X})^{-1}\).

Hence,

\[ \frac{\hat{\beta}_j - \beta_j}{\sigma \sqrt{c_{jj}}} \sim N(0,1). \]

This is useful, but it still depends on the unknown \(\sigma\).

8 4. Residual Sum of Squares and Estimation of \(\sigma^2\)

8.1 Residual vector

The residual vector is

\[ \mathbf{e} = \mathbf{Y} - \hat{\mathbf{Y}} = (\mathbf{I}_n - \mathbf{H})\mathbf{Y}, \]

where

\[ \mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \]

is the hat matrix.

8.2 Residual sum of squares

The residual sum of squares is

\[ \mathrm{SSE} = \mathbf{e}^\top \mathbf{e} = (\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})^\top (\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}). \]

Equivalently,

\[ \mathrm{SSE} = \mathbf{Y}^\top(\mathbf{I}_n - \mathbf{H})\mathbf{Y}. \]

8.3 Distribution of SSE

Under the normal linear model,

\[ \frac{\mathrm{SSE}}{\sigma^2} \sim \chi^2_{n-p}. \]

The degrees of freedom are \(n-p\) because we estimate \(p\) parameters.

8.4 Unbiased estimator of \(\sigma^2\)

Since the mean of a chi-square random variable with \(k\) degrees of freedom is \(k\), we have

\[ \mathbb{E}[\mathrm{SSE}] = (n-p)\sigma^2. \]

Therefore,

\[ \hat{\sigma}^2 = \frac{\mathrm{SSE}}{n-p} \]

is an unbiased estimator of \(\sigma^2\).

This quantity is also called the mean squared error:

\[ \mathrm{MSE} = \frac{\mathrm{SSE}}{n-p}. \]

9 5. Independence Between \(\hat{\boldsymbol{\beta}}\) and SSE

A crucial result under the normal linear model is that

\[ \hat{\boldsymbol{\beta}} \quad \text{and} \quad \mathrm{SSE} \]

are independent.

This is special and extremely useful. It is what allows us to replace the unknown \(\sigma\) with \(\hat{\sigma}\) and obtain exact \(t\) and \(F\) distributions.

Intuitively:

\(\hat{\boldsymbol{\beta}}\) depends on the projected part of \(\mathbf{Y}\) onto \(\mathcal{C}(\mathbf{X})\);
\(\mathrm{SSE}\) depends on the orthogonal residual part.

Because these parts are orthogonal and jointly normal, they are independent.

10 6. Inference for a Single Coefficient

10.1 6.1 t statistic

Since

\[ \hat{\beta}_j \sim N(\beta_j, \sigma^2 c_{jj}), \]

and since

\[ \frac{(n-p)\hat{\sigma}^2}{\sigma^2} = \frac{\mathrm{SSE}}{\sigma^2} \sim \chi^2_{n-p}, \]

with independence between \(\hat{\beta}_j\) and \(\hat{\sigma}^2\), it follows that

\[ T = \frac{\hat{\beta}_j - \beta_j}{\hat{\sigma}\sqrt{c_{jj}}} \sim t_{n-p}. \]

10.2 6.2 Confidence interval

A \(100(1-\alpha)\%\) confidence interval for \(\beta_j\) is

\[ \hat{\beta}_j \pm t_{1-\alpha/2,\;n-p}\, \hat{\sigma}\sqrt{c_{jj}}. \]

10.3 6.3 Hypothesis test

To test

\[ H_0: \beta_j = \beta_{j,0} \qquad \text{versus} \qquad H_1: \beta_j \ne \beta_{j,0}, \]

we use

\[ T = \frac{\hat{\beta}_j - \beta_{j,0}}{\hat{\sigma}\sqrt{c_{jj}}}. \]

Under \(H_0\),

\[ T \sim t_{n-p}. \]

For a two-sided test, reject \(H_0\) if

\[ |T| > t_{1-\alpha/2,\;n-p}. \]

11 7. Inference for Linear Combinations

Often we are interested not in a single coefficient, but in a linear combination

\[ a^\top \boldsymbol{\beta}, \]

where \(a\) is a fixed \(p \times 1\) vector.

For example:

a single coefficient;
the difference between two coefficients;
the mean response at a given covariate value.

Since

\[ a^\top \hat{\boldsymbol{\beta}} \sim N\!\left( a^\top \boldsymbol{\beta}, \sigma^2 a^\top (\mathbf{X}^\top \mathbf{X})^{-1} a \right), \]

we obtain

\[ \frac{ a^\top \hat{\boldsymbol{\beta}} - a^\top \boldsymbol{\beta} }{ \hat{\sigma}\sqrt{a^\top (\mathbf{X}^\top \mathbf{X})^{-1} a} } \sim t_{n-p}. \]

Thus, a confidence interval for \(a^\top \boldsymbol{\beta}\) is

\[ a^\top \hat{\boldsymbol{\beta}} \pm t_{1-\alpha/2,\;n-p} \hat{\sigma} \sqrt{a^\top (\mathbf{X}^\top \mathbf{X})^{-1} a}. \]

12 8. General Linear Hypotheses and F Tests

12.1 8.1 Linear hypothesis

Suppose we want to test

\[ H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{d}, \]

where \(\mathbf{C}\) is an \(r \times p\) matrix of rank \(r\), and \(\mathbf{d}\) is an \(r \times 1\) vector.

This is called a general linear hypothesis.

Examples include:

testing one coefficient;
testing several coefficients simultaneously;
testing equality of coefficients.

12.2 8.2 F statistic

Under \(H_0\), the appropriate test statistic is

\[ F = \frac{ (\mathbf{C}\hat{\boldsymbol{\beta}} - \mathbf{d})^\top \left[ \mathbf{C}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{C}^\top \right]^{-1} (\mathbf{C}\hat{\boldsymbol{\beta}} - \mathbf{d})/r }{ \hat{\sigma}^2 }. \]

Under \(H_0\),

\[ F \sim F_{r,\;n-p}. \]

This gives the general framework for multi-parameter tests.

12.3 8.3 Relationship between t and F

When \(r=1\), the \(F\) test reduces to the square of the corresponding \(t\) test:

\[ F = T^2. \]

13 9. Confidence Intervals for the Mean Response

Suppose we want to estimate the mean response at a covariate vector \(x_0\).

Let

\[ \mu(x_0) = x_0^\top \boldsymbol{\beta}. \]

The estimator is

\[ \hat{\mu}(x_0) = x_0^\top \hat{\boldsymbol{\beta}}. \]

Its variance is

\[ \mathrm{Var}(\hat{\mu}(x_0)) = \sigma^2 x_0^\top (\mathbf{X}^\top \mathbf{X})^{-1} x_0. \]

Therefore,

\[ \frac{ x_0^\top \hat{\boldsymbol{\beta}} - x_0^\top \boldsymbol{\beta} }{ \hat{\sigma}\sqrt{x_0^\top (\mathbf{X}^\top \mathbf{X})^{-1} x_0} } \sim t_{n-p}. \]

A \(100(1-\alpha)\%\) confidence interval for the mean response at \(x_0\) is

\[ x_0^\top \hat{\boldsymbol{\beta}} \pm t_{1-\alpha/2,\;n-p} \hat{\sigma} \sqrt{x_0^\top (\mathbf{X}^\top \mathbf{X})^{-1} x_0}. \]

14 10. Prediction Interval for a New Observation

Now suppose we want to predict a new future response at \(x_0\):

\[ Y_{\mathrm{new}} = x_0^\top \boldsymbol{\beta} + \varepsilon_{\mathrm{new}}, \]

where

\[ \varepsilon_{\mathrm{new}} \sim N(0,\sigma^2) \]

and is independent of the original data.

The prediction error is

\[ Y_{\mathrm{new}} - x_0^\top \hat{\boldsymbol{\beta}}. \]

Its variance is

\[ \mathrm{Var}(Y_{\mathrm{new}} - x_0^\top \hat{\boldsymbol{\beta}}) = \sigma^2 \left( 1 + x_0^\top (\mathbf{X}^\top \mathbf{X})^{-1} x_0 \right). \]

Hence, a \(100(1-\alpha)\%\) prediction interval for a new observation is

\[ x_0^\top \hat{\boldsymbol{\beta}} \pm t_{1-\alpha/2,\;n-p} \hat{\sigma} \sqrt{ 1 + x_0^\top (\mathbf{X}^\top \mathbf{X})^{-1} x_0 }. \]

14.1 Key difference

Confidence interval for the mean response: uncertainty about the regression mean;
Prediction interval: uncertainty about the regression mean plus random individual variation.

Therefore, prediction intervals are always wider.

15 11. Worked Example by Hand

Consider again the dataset

\[ \begin{array}{c|cccc} x_i & 0 & 1 & 2 & 3 \\ \hline y_i & 1 & 3 & 3 & 5 \end{array} \]

with design matrix

\[ \mathbf{X} = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \qquad \mathbf{Y} = \begin{bmatrix} 1 \\ 3 \\ 3 \\ 5 \end{bmatrix}. \]

From Week 2, we found

\[ \hat{\boldsymbol{\beta}} = \begin{bmatrix} 1.2 \\ 1.2 \end{bmatrix}. \]

The fitted values are

\[ \hat{\mathbf{Y}} = \begin{bmatrix} 1.2 \\ 2.4 \\ 3.6 \\ 4.8 \end{bmatrix}, \]

so the residuals are

\[ \mathbf{e} = \begin{bmatrix} -0.2 \\ 0.6 \\ -0.6 \\ 0.2 \end{bmatrix}. \]

Thus,

\[ \mathrm{SSE} = (-0.2)^2 + 0.6^2 + (-0.6)^2 + 0.2^2 = 0.8. \]

Since \(n=4\) and \(p=2\),

\[ \hat{\sigma}^2 = \frac{0.8}{4-2} = 0.4, \qquad \hat{\sigma} = \sqrt{0.4}. \]

Also,

\[ (\mathbf{X}^\top \mathbf{X})^{-1} = \frac{1}{20} \begin{bmatrix} 14 & -6 \\ -6 & 4 \end{bmatrix}. \]

Hence the estimated variance of \(\hat{\beta}_1\) is

\[ \widehat{\mathrm{Var}}(\hat{\beta}_1) = 0.4 \cdot \frac{4}{20} = 0.08, \]

so the standard error is

\[ \mathrm{SE}(\hat{\beta}_1) = \sqrt{0.08}. \]

To test

\[ H_0: \beta_1 = 0, \]

the \(t\) statistic is

\[ T = \frac{1.2}{\sqrt{0.08}}. \]

16 12. R Demonstration

16.1 12.1 Fit the model

x <- c(0, 1, 2, 3)
y <- c(1, 3, 3, 5)

fit <- lm(y ~ x)
summary(fit)


Call:
lm(formula = y ~ x)

Residuals:
   1    2    3    4 
-0.2  0.6 -0.6  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   1.2000     0.5292   2.268   0.1515  
x             1.2000     0.2828   4.243   0.0513 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6325 on 2 degrees of freedom
Multiple R-squared:    0.9, Adjusted R-squared:   0.85 
F-statistic:    18 on 1 and 2 DF,  p-value: 0.05132

coef(fit)

(Intercept)           x 
        1.2         1.2

vcov(fit)

            (Intercept)     x
(Intercept)        0.28 -0.12
x                 -0.12  0.08

sqrt(diag(vcov(fit)))

(Intercept)           x 
  0.5291503   0.2828427

deviance(fit)

[1] 0.8

sigma(fit)^2

[1] 0.4

sigma(fit)

[1] 0.6324555

confint(fit)

                  2.5 %   97.5 %
(Intercept) -1.07674982 3.476750
x           -0.01697397 2.416974

summary(fit)$coefficients

            Estimate Std. Error  t value  Pr(>|t|)
(Intercept)      1.2  0.5291503 2.267787 0.1514719
x                1.2  0.2828427 4.242641 0.0513167

newdat <- data.frame(x = 2)
predict(fit, newdata = newdat, interval = "confidence")

  fit      lwr      upr
1 3.6 2.109517 5.090483

predict(fit, newdata = newdat, interval = "prediction")

  fit      lwr      upr
1 3.6 0.497313 6.702687

xg <- seq(min(x), max(x), length.out = 100)
out_conf <- predict(fit, newdata = data.frame(x = xg), interval = "confidence")
out_pred <- predict(fit, newdata = data.frame(x = xg), interval = "prediction")

plot(x, y, pch = 19, xlab = "x", ylab = "y")
lines(xg, out_conf[, "fit"], lwd = 2)
lines(xg, out_conf[, "lwr"], lty = 2)
lines(xg, out_conf[, "upr"], lty = 2)
lines(xg, out_pred[, "lwr"], lty = 3)
lines(xg, out_pred[, "upr"], lty = 3)

17 13. Interpretation of Standard Output

In regression output from summary(lm(…)), the key columns are:

Estimate: the estimated coefficient;
Std. Error: the estimated standard deviation of the estimator;
t value: the test statistic for testing whether the coefficient equals zero;
Pr(>|t|): the corresponding \(p\)-value.

The output also reports:

residual standard error;
degrees of freedom;
\(R^2\) and adjusted \(R^2\);
an overall \(F\) test.

We will discuss the overall ANOVA-style decomposition more formally soon.

17.1 14. In-Class Discussion Questions

Why does normality lead to exact finite-sample inference?
Why do we divide SSE by \(n-p\) rather than \(n\)?
Why are prediction intervals wider than confidence intervals for the mean response?
Why is independence between \(\hat{\boldsymbol{\beta}}\) and SSE so important?

17.2 15. Practice Problems

Conceptual 1. Explain the difference between the sampling distribution of \(\hat{\beta}_j\) and the distribution of \(Y_i\). 2. Explain why \(\hat{\sigma}^2 = \mathrm{SSE}/(n-p)\) is unbiased. 3. Explain why the \(t\) distribution appears instead of the normal distribution.

Computational

Suppose

\[ (\mathbf{X}^\top \mathbf{X})^{-1} = \begin{bmatrix} 0.5 & 0.1 \ 0.1 & 0.2 \end{bmatrix}, \]

\(\hat{\boldsymbol{\beta}} = (2, -1)^\top\), and \(\hat{\sigma}^2 = 4\). 1. Find the standard error of \(\hat{\beta}_2\). 2. Construct a confidence interval for \(\beta_2\) using a generic critical value \(t^\star\). 3. For \(x_0 = (1,3)^\top\), compute the estimated variance of the fitted mean. 4. Write down the form of the prediction interval at \(x_0\).

Proof-based

Show that if

\[ \hat{\boldsymbol{\beta}} \sim N_p!\left(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\right), \]

then for any fixed vector \(a\),

\[ a^\top \hat{\boldsymbol{\beta}} \sim N!\left( a^\top \boldsymbol{\beta}, \sigma^2 a^\top(\mathbf{X}^\top\mathbf{X})^{-1}a \right). \]

Suggested Homework

Complete the following tasks: • derive the distribution of \(\hat{\boldsymbol{\beta}}\) under the normal linear model; • prove that \(\hat{\sigma}^2 = \mathrm{SSE}/(n-p)\) is unbiased; • derive the \(t\) statistic for one coefficient; • construct a confidence interval for the mean response at a chosen covariate value; • construct a prediction interval for a future observation at the same covariate value; • fit a regression model in R and interpret all coefficient-level inferential output.

Summary

In this week, we moved from estimation to inference under the normal linear model.

We showed that

\[ \hat{\boldsymbol{\beta}} \sim N_p!\left( \boldsymbol{\beta}, \sigma^2(\mathbf{X}^\top \mathbf{X})^{-1} \right), \]

and that

\[ \frac{\mathrm{SSE}}{\sigma^2} \sim \chi^2_{n-p}, \qquad \hat{\sigma}^2 = \frac{\mathrm{SSE}}{n-p}. \]

These results, together with independence between \(\hat{\boldsymbol{\beta}}\) and SSE, lead to exact \(t\) and \(F\) inference.

Next week, we will develop the ANOVA decomposition in regression and study the overall significance test, nested models, and extra sum of squares.

Appendix: Useful Distribution Facts

\[ Z \sim N(0,1), \qquad U \sim \chi^2_\nu, \]

and \(Z\) and \(U\) are independent, then

\[ \frac{Z}{\sqrt{U/\nu}} \sim t_\nu. \]

\[ U_1 \sim \chi^2_{r}, \qquad U_2 \sim \chi^2_{\nu}, \]

and \(U_1\) and \(U_2\) are independent, then

\[ \frac{(U_1/r)}{(U_2/\nu)} \sim F_{r,\nu}. \]

These are the basic building blocks for regression inference.