5  Week 4: ANOVA Decomposition, Overall F Test, and Nested Models

In this week, we study the ANOVA decomposition for linear regression, the overall significance test, and the comparison of nested models. These ideas connect the geometry of least squares with the inferential tools developed in previous weeks.

5.1 Learning Objectives

By the end of this week, students should be able to:

  • define the total, regression, and error sums of squares;
  • explain the ANOVA decomposition in regression with an intercept;
  • interpret the degrees of freedom associated with SST, SSR, and SSE;
  • perform the overall \(F\) test for regression;
  • compare nested linear models using extra sums of squares;
  • interpret ANOVA tables produced by statistical software.

5.2 Reading

Recommended reading for this week:

  • Seber and Lee:
    • sections on analysis of variance in regression
    • sums of squares and decomposition of variability
    • tests for nested models
  • Montgomery, Peck, and Vining:
    • sections on the ANOVA table
    • overall model significance
    • partial and sequential sums of squares

5.3 Review of the Linear Model

Recall the normal linear model

\[ \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim N_n(\mathbf{0}, \sigma^2 \mathbf{I}_n). \]

When \(\mathbf{X}\) has full column rank, the ordinary least squares estimator is

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{Y}, \]

and the fitted values are

\[ \hat{\mathbf{Y}} = \mathbf{X}\hat{\boldsymbol{\beta}}. \]

The residual vector is

\[ \mathbf{e} = \mathbf{Y} - \hat{\mathbf{Y}}. \]

From Week 2, we know that \(\hat{\mathbf{Y}}\) and \(\mathbf{e}\) are orthogonal. From Week 3, we know that this leads to useful distributional results for inference.

This week, we organize these ideas into the ANOVA framework.

5.4 Total, Explained, and Unexplained Variation

A central question in regression is:

How much of the variation in the response can be explained by the model?

To answer this, we decompose the total variation in \(\mathbf{Y}\) into:

  • variation explained by regression;
  • variation left unexplained by the model.

When the model includes an intercept, this decomposition takes a particularly simple and important form.

5.5 Total Sum of Squares

Assume the model contains an intercept.

The total variation in the response is measured by the total sum of squares

\[ \mathrm{SST} = \sum_{i=1}^n (Y_i - \bar{Y})^2, \]

where

\[ \bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i. \]

In vector form, if \(\mathbf{1}\) denotes the vector of ones, then

\[ \mathrm{SST} = (\mathbf{Y} - \bar{Y}\mathbf{1})^\top(\mathbf{Y} - \bar{Y}\mathbf{1}). \]

This measures variation around the sample mean.

5.6 Error Sum of Squares

The error sum of squares is

\[ \mathrm{SSE} = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = \mathbf{e}^\top \mathbf{e}. \]

This is the variation not explained by the fitted regression model.

It measures how far the observed responses are from the fitted values.

5.7 Regression Sum of Squares

The regression sum of squares is

\[ \mathrm{SSR} = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2. \]

This measures the part of the total variation explained by the regression model.

In vector form,

\[ \mathrm{SSR} = (\hat{\mathbf{Y}} - \bar{Y}\mathbf{1})^\top(\hat{\mathbf{Y}} - \bar{Y}\mathbf{1}). \]

5.8 The ANOVA Decomposition

When the model includes an intercept, we have the decomposition

\[ \mathrm{SST} = \mathrm{SSR} + \mathrm{SSE}. \]

This is one of the most important identities in regression.

It says that the total variation around the sample mean can be decomposed into:

  • variation explained by the regression model;
  • variation remaining in the residuals.

5.9 Why the Decomposition Holds

The key reason is orthogonality.

We can write

\[ \mathbf{Y} - \bar{Y}\mathbf{1} = (\hat{\mathbf{Y}} - \bar{Y}\mathbf{1}) + (\mathbf{Y} - \hat{\mathbf{Y}}). \]

That is,

\[ \mathbf{Y} - \bar{Y}\mathbf{1} = (\hat{\mathbf{Y}} - \bar{Y}\mathbf{1}) + \mathbf{e}. \]

Because the model contains an intercept, the vector \(\bar{Y}\mathbf{1}\) lies in the column space of \(\mathbf{X}\). Hence both \(\hat{\mathbf{Y}}\) and \(\bar{Y}\mathbf{1}\) lie in the model space, so their difference also lies in the model space.

Since the residual vector \(\mathbf{e}\) is orthogonal to the model space, we have

\[ (\hat{\mathbf{Y}} - \bar{Y}\mathbf{1})^\top \mathbf{e} = 0. \]

Therefore, by the Pythagorean theorem,

\[ \|\mathbf{Y} - \bar{Y}\mathbf{1}\|^2 = \|\hat{\mathbf{Y}} - \bar{Y}\mathbf{1}\|^2 + \|\mathbf{e}\|^2, \]

which is exactly

\[ \mathrm{SST} = \mathrm{SSR} + \mathrm{SSE}. \]

5.10 Degrees of Freedom

The ANOVA decomposition is accompanied by a decomposition of degrees of freedom.

When the model includes an intercept and \(\mathbf{X}\) has rank \(p\), we have:

  • total degrees of freedom: \(n-1\);
  • regression degrees of freedom: \(p-1\);
  • error degrees of freedom: \(n-p\).

Thus,

\[ n-1 = (p-1) + (n-p). \]

These match the sum of squares decomposition:

\[ \mathrm{SST} = \mathrm{SSR} + \mathrm{SSE}. \]

5.11 Mean Squares

To compare sums of squares on a common scale, we divide by their associated degrees of freedom.

The mean square for regression is

\[ \mathrm{MSR} = \frac{\mathrm{SSR}}{p-1}. \]

The mean square error is

\[ \mathrm{MSE} = \frac{\mathrm{SSE}}{n-p}. \]

From Week 3, we know that MSE is an unbiased estimator of \(\sigma^2\).

5.12 The Overall F Test

A major inferential question is whether the regression model provides any explanatory power beyond the intercept-only model.

Suppose the model includes an intercept and \(p-1\) additional predictors. The null hypothesis is

\[ H_0: \beta_2 = \beta_3 = \cdots = \beta_p = 0, \]

if the first coefficient corresponds to the intercept.

Equivalently, under \(H_0\), the mean response does not depend on the predictors.

The alternative is that at least one non-intercept coefficient is nonzero.

5.13 Test Statistic

The overall \(F\) statistic is

\[ F = \frac{\mathrm{MSR}}{\mathrm{MSE}} = \frac{\mathrm{SSR}/(p-1)}{\mathrm{SSE}/(n-p)}. \]

Under the null hypothesis,

\[ F \sim F_{p-1,\;n-p}. \]

Large values of \(F\) provide evidence against \(H_0\).

5.14 Interpretation of the Overall F Test

The numerator measures explained variation per regression degree of freedom.

The denominator measures unexplained variation per residual degree of freedom.

So the \(F\) statistic compares:

  • how much signal the model explains;
  • how much noise remains in the residuals.

If the predictors have no effect, then both quantities should be of similar size, and the ratio should not be unusually large.

If the predictors explain substantial variation, then the numerator should be much larger than the denominator.

5.15 Relationship to the Intercept-Only Model

The overall \(F\) test compares two models:

  • the reduced model: intercept only;
  • the full model: intercept plus predictors.

Thus the ANOVA decomposition provides the basis for formal model comparison.

This leads naturally to the idea of nested models.

5.16 Nested Models

Two models are nested if the reduced model is obtained by imposing constraints on the full model.

For example:

  • reduced model: \[ Y_i = \beta_0 + \beta_1 x_{i1} + \varepsilon_i; \]

  • full model: \[ Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i. \]

The reduced model is nested within the full model because it is obtained by setting

\[ \beta_2 = 0. \]

5.17 Extra Sum of Squares Principle

Suppose:

  • the reduced model has error sum of squares \(\mathrm{SSE}_R\) and rank \(p_R\);
  • the full model has error sum of squares \(\mathrm{SSE}_F\) and rank \(p_F\).

Since the full model has more flexibility, it cannot fit worse, so

\[ \mathrm{SSE}_F \le \mathrm{SSE}_R. \]

The quantity

\[ \mathrm{SSE}_R - \mathrm{SSE}_F \]

measures the reduction in error due to adding the extra predictors.

This is called the extra sum of squares due to the added terms.

5.18 F Test for Nested Models

To test whether the extra predictors in the full model are needed, we use

\[ F = \frac{ (\mathrm{SSE}_R - \mathrm{SSE}_F)/(p_F - p_R) }{ \mathrm{SSE}_F/(n-p_F) }. \]

Under the null hypothesis that the additional parameters are unnecessary,

\[ F \sim F_{p_F-p_R,\;n-p_F}. \]

Large values indicate that the full model provides a significantly better fit.

5.19 Connection with General Linear Hypotheses

This nested-model \(F\) test is equivalent to testing a general linear hypothesis of the form

\[ H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{d}. \]

So the ANOVA comparison of nested models is another way of expressing the general \(F\) test from Week 3.

In practice, this is one of the most common uses of regression ANOVA tables.

5.20 Sequential and Partial Sums of Squares

In multiple regression, sums of squares can be defined in different ways depending on what is being adjusted for.

Two common ideas are:

  • sequential sums of squares: terms are added in a specified order;
  • partial sums of squares: each term is tested after adjusting for the others.

This distinction becomes important when predictors are correlated.

In this course, the main conceptual priority is to understand the extra sum of squares principle. Details of Type I, Type II, and Type III sums of squares can be introduced later if needed.

5.21 Coefficient of Determination

The coefficient of determination is

\[ R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}. \]

It measures the proportion of total variation explained by the regression model.

Its values lie between 0 and 1.

5.22 Interpretation of R Squared

  • If \(R^2\) is close to 1, the model explains a large proportion of the variation in the response.
  • If \(R^2\) is close to 0, the model explains little of the variation.

However, \(R^2\) alone does not guarantee that the model is appropriate. A high \(R^2\) does not ensure that assumptions are satisfied, and a low \(R^2\) does not necessarily imply the model is useless.

5.23 Adjusted R Squared

Because \(R^2\) never decreases when additional predictors are added, it can overstate improvement.

A commonly used adjustment is

\[ R^2_{\mathrm{adj}} = 1 - \frac{\mathrm{SSE}/(n-p)}{\mathrm{SST}/(n-1)}. \]

Adjusted \(R^2\) penalizes the inclusion of unnecessary predictors.

5.24 Worked Example by Hand

Consider the data

\[ \begin{array}{c|cccc} x_i & 0 & 1 & 2 & 3 \\ \hline y_i & 1 & 3 & 3 & 5 \end{array} \]

We already found that the fitted regression line is

\[ \hat{Y} = 1.2 + 1.2x. \]

The observed response vector is

\[ \mathbf{Y} = \begin{bmatrix} 1 \\ 3 \\ 3 \\ 5 \end{bmatrix}, \]

and the fitted values are

\[ \hat{\mathbf{Y}} = \begin{bmatrix} 1.2 \\ 2.4 \\ 3.6 \\ 4.8 \end{bmatrix}. \]

The sample mean is

\[ \bar{Y} = 3. \]

5.24.1 Compute SST

\[ \mathrm{SST} = (1-3)^2 + (3-3)^2 + (3-3)^2 + (5-3)^2 = 4 + 0 + 0 + 4 = 8. \]

5.24.2 Compute SSE

The residuals are

\[ \mathbf{e} = \begin{bmatrix} -0.2 \\ 0.6 \\ -0.6 \\ 0.2 \end{bmatrix}, \]

so

\[ \mathrm{SSE} = (-0.2)^2 + 0.6^2 + (-0.6)^2 + 0.2^2 = 0.8. \]

5.24.3 Compute SSR

Using \(\mathrm{SSR} = \mathrm{SST} - \mathrm{SSE}\),

\[ \mathrm{SSR} = 8 - 0.8 = 7.2. \]

5.24.4 Check the decomposition

\[ 8 = 7.2 + 0.8. \]

5.24.5 Degrees of freedom

Here \(n=4\) and \(p=2\), so:

  • total df: \(4-1=3\);
  • regression df: \(2-1=1\);
  • error df: \(4-2=2\).

5.24.6 Mean squares

\[ \mathrm{MSR} = \frac{7.2}{1} = 7.2, \qquad \mathrm{MSE} = \frac{0.8}{2} = 0.4. \]

5.24.7 F statistic

\[ F = \frac{7.2}{0.4} = 18. \]

This is the overall significance test for the regression.

5.25 ANOVA Table Structure

A standard regression ANOVA table has the following structure:

Source Sum of Squares Degrees of Freedom Mean Square F
Regression SSR \(p-1\) MSR MSR/MSE
Error SSE \(n-p\) MSE
Total SST \(n-1\)

Students should learn to move fluently between:

  • formulas;
  • geometric interpretation;
  • software output.

5.26 R Demonstration

5.27 Fit a simple regression model

x <- c(0, 1, 2, 3)
y <- c(1, 3, 3, 5)

fit <- lm(y ~ x)
summary(fit)

Call:
lm(formula = y ~ x)

Residuals:
   1    2    3    4 
-0.2  0.6 -0.6  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   1.2000     0.5292   2.268   0.1515  
x             1.2000     0.2828   4.243   0.0513 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6325 on 2 degrees of freedom
Multiple R-squared:    0.9, Adjusted R-squared:   0.85 
F-statistic:    18 on 1 and 2 DF,  p-value: 0.05132

5.28 Obtain the ANOVA table

anova(fit)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value  Pr(>F)  
x          1    7.2     7.2      18 0.05132 .
Residuals  2    0.8     0.4                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.29 Verify sums of squares manually

ybar <- mean(y)
yhat <- fitted(fit)
e <- resid(fit)

SST <- sum((y - ybar)^2)
SSE <- sum(e^2)
SSR <- sum((yhat - ybar)^2)

c(SST = SST, SSR = SSR, SSE = SSE)
SST SSR SSE 
8.0 7.2 0.8 
SST - SSR - SSE
[1] 2.220446e-16

5.30 Compute R squared manually

R2 <- SSR / SST
R2
[1] 0.9
summary(fit)$r.squared
[1] 0.9
summary(fit)$adj.r.squared
[1] 0.85

5.31 Compare nested models

dat <- data.frame(
  y = c(4, 5, 7, 10, 8, 12, 13, 14),
  x1 = c(1, 2, 3, 4, 5, 6, 7, 8),
  x2 = c(2, 1, 3, 2, 5, 4, 6, 5)
)

fit_reduced <- lm(y ~ x1, data = dat)
fit_full <- lm(y ~ x1 + x2, data = dat)

anova(fit_reduced, fit_full)
Analysis of Variance Table

Model 1: y ~ x1
Model 2: y ~ x1 + x2
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1      6 6.8214                           
2      5 4.6613  1    2.1601 2.3171 0.1884

5.32 Inspect the two fitted models

summary(fit_reduced)

Call:
lm(formula = y ~ x1, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.85714 -0.30357  0.03571  0.33036  1.60714 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.5357     0.8308   3.052 0.022453 *  
x1            1.4643     0.1645   8.900 0.000112 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.066 on 6 degrees of freedom
Multiple R-squared:  0.9296,    Adjusted R-squared:  0.9178 
F-statistic: 79.21 on 1 and 6 DF,  p-value: 0.0001121
summary(fit_full)

Call:
lm(formula = y ~ x1 + x2, data = dat)

Residuals:
      1       2       3       4       5       6       7       8 
 0.4032 -1.0403  0.3306  0.8871 -1.1371  0.4194  0.7903 -0.6532 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   2.9677     0.8041   3.691  0.01413 * 
x1            1.8387     0.2876   6.394  0.00139 **
x2           -0.6048     0.3973  -1.522  0.18845   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9655 on 5 degrees of freedom
Multiple R-squared:  0.9519,    Adjusted R-squared:  0.9326 
F-statistic: 49.46 on 2 and 5 DF,  p-value: 0.0005079

5.33 Interpretation of Software Output

For a fitted model in R:

  • anova(fit) gives the ANOVA decomposition for a single model;
  • anova(fit_reduced, fit_full) compares nested models;
  • summary(fit) reports the overall \(F\) statistic, \(R^2\), and adjusted \(R^2\).

Students should understand that these outputs are not separate topics. They are all built from the same least squares geometry and distribution theory.

5.34 In-Class Discussion Questions

  1. Why does the ANOVA decomposition require an intercept for the usual SST = SSR + SSE identity?
  2. Why must \(\mathrm{SSE}_F \le \mathrm{SSE}_R\) for nested models?
  3. What does the overall \(F\) test tell us that individual \(t\) tests do not?
  4. Why can \(R^2\) be misleading if used alone?

5.35 Practice Problems

5.36 Conceptual

  1. Explain the meaning of SST, SSR, and SSE in words.
  2. Explain why the regression degrees of freedom are \(p-1\) when the model includes an intercept.
  3. Explain the difference between the overall \(F\) test and a test for a single coefficient.

5.37 Computational

Suppose a regression model with intercept has:

  • \(n=20\),
  • \(p=4\),
  • \(\mathrm{SST}=100\),
  • \(\mathrm{SSE}=40\).

Compute:

  1. \(\mathrm{SSR}\),
  2. the degrees of freedom for regression and error,
  3. \(\mathrm{MSR}\),
  4. \(\mathrm{MSE}\),
  5. the overall \(F\) statistic,
  6. \(R^2\).

5.38 Nested Model Problem

A reduced model has

\[ \mathrm{SSE}_R = 120 \]

with \(p_R = 3\), and a full model has

\[ \mathrm{SSE}_F = 90 \]

with \(p_F = 5\).

If \(n=30\), compute the nested-model \(F\) statistic.

5.39 Suggested Homework

Complete the following tasks:

  • prove the decomposition \(\mathrm{SST}=\mathrm{SSR}+\mathrm{SSE}\) when the model includes an intercept;
  • derive the overall \(F\) statistic from the ANOVA decomposition;
  • fit a regression model in R and reproduce the ANOVA table by hand;
  • compare two nested models using an extra sum of squares test;
  • interpret both \(R^2\) and adjusted \(R^2\) for a chosen dataset.

5.40 Summary

In this week, we developed the ANOVA framework for linear regression.

We defined:

\[ \mathrm{SST}, \qquad \mathrm{SSR}, \qquad \mathrm{SSE}, \]

and showed that, with an intercept,

\[ \mathrm{SST} = \mathrm{SSR} + \mathrm{SSE}. \]

This decomposition led to:

  • the ANOVA table;
  • the overall \(F\) test for regression;
  • the comparison of nested models through extra sums of squares;
  • the interpretation of \(R^2\) and adjusted \(R^2\).

Next week, a natural continuation is to study multiple regression in greater depth, including interpretation of partial regression coefficients and multicollinearity, or to move into matrix-based general linear hypotheses and estimability, depending on the course emphasis.

5.41 Appendix: Compact Formula Summary

With an intercept in the model,

\[ \mathrm{SST} = \sum_{i=1}^n (Y_i-\bar{Y})^2, \]

\[ \mathrm{SSE} = \sum_{i=1}^n (Y_i-\hat{Y}_i)^2, \]

\[ \mathrm{SSR} = \sum_{i=1}^n (\hat{Y}_i-\bar{Y})^2, \]

and

\[ \mathrm{SST} = \mathrm{SSR} + \mathrm{SSE}. \]

Also,

\[ \mathrm{MSR} = \frac{\mathrm{SSR}}{p-1}, \qquad \mathrm{MSE} = \frac{\mathrm{SSE}}{n-p}, \qquad F = \frac{\mathrm{MSR}}{\mathrm{MSE}}, \]

and

\[ R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1-\frac{\mathrm{SSE}}{\mathrm{SST}}. \]