18 Model Selection in SAS

Learning Objectives

By the end of this lecture, you should be able to:

Understand the main goal of model selection

Explain the bias-variance trade-off

Understand why training error is not enough

Use cross-validation to assess prediction performance

Interpret AIC and BIC as model comparison tools

Implement model selection in SAS using PROC GLMSELECT

18.1 Introduction

In many modern data problems, we often have several possible predictors.

A natural practical question is:

We have many variables, but which ones should we use in the model?

More generally, model selection includes questions such as:

which predictors to include
whether to use a simpler or more complex model
whether interaction terms are needed
how to choose tuning parameters or model forms

The goal of model selection is not simply to fit the largest model possible. Instead, we want a model that is:

reasonably simple and interpretable
accurate for prediction on new data

18.2 Why Model Selection Matters

A model that is too simple may miss important patterns in the data.

A model that is too complex may fit the training data very well, but perform poorly on new data.

So model selection is really about finding a model that generalizes well.

Main Idea

A good model balances:

fit
simplicity
prediction performance

18.3 Bias-Variance Trade-Off

A central idea in model selection is the bias-variance trade-off.

Suppose

\[ Y = f(X) + \epsilon, \qquad \epsilon \sim N(0, \sigma_\epsilon^2). \]

Then the expected prediction error at a point (x) can be decomposed as

\[ E\big[(Y - \hat f(x))^2\big] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}. \]

18.3.1 Intuition

A simple model often has high bias but low variance
A complex model often has low bias but high variance

As model complexity increases:

bias usually decreases
variance usually increases

So the total prediction error is often minimized at some intermediate level of complexity.

If a model is too simple, it may underfit the data
If a model is too flexible, it may overfit the data
The best model is often somewhere in between

18.4 Why Training Error Is Not Enough

Suppose we fit a model using the training data.

Then the training MSE often underestimates the true prediction error, because the model has already seen those data.

So if our goal is prediction, we need a better way to estimate how well the model performs on new observations.

This motivates cross-validation.

18.5 Cross-Validation

The goal of cross-validation is to estimate how well a model predicts new data.

18.5.1 Basic idea

Split the data into two parts:
- a pseudo-training set
- a pseudo-test set
Fit the model on the pseudo-training set
Evaluate prediction error on the pseudo-test set

This gives an estimate of test MSE.

18.6 K-Fold Cross-Validation

In K-fold cross-validation:

Randomly divide the data into (K) folds
For each fold (k):
- use fold (k) as the test set
- use the remaining (K-1) folds as the training set
Compute the prediction error for each fold
Average the errors

The cross-validation estimate is

\[ \widehat{\text{MSE}}_{CV} = \frac{1}{K} \sum_{k=1}^K \widehat{\text{MSE}}_k. \]

If (K=n), this becomes leave-one-out cross-validation (LOOCV).

A single train/test split can be unstable.

K-fold cross-validation improves stability by averaging prediction error across multiple splits.

18.7 Information Criteria: AIC and BIC

Besides prediction-based approaches, model selection can also be based on information criteria.

Two common criteria are:

Akaike’s Information Criterion (AIC)
Bayesian Information Criterion (BIC)

18.7.1 Main idea

Both AIC and BIC try to balance:

goodness of fit
model complexity

A model with more parameters may fit the data better, but it is also penalized for being more complex.

18.7.2 Interpretation

Smaller AIC is better
Smaller BIC is better

In general:

AIC tends to be a bit more flexible
BIC tends to penalize complexity more strongly

Practical Interpretation

If two models are fitted to the same data, the one with the smaller AIC or BIC is usually preferred.

18.8 `PROC GLMSELECT`

In SAS, one important procedure for model selection is PROC GLMSELECT.

It is designed for model selection in general linear models, especially when there are many candidate effects.

18.8.1 Why `PROC GLMSELECT` is useful

It provides flexibility for:

large numbers of candidate predictors
interaction terms
classification effects
training / validation / testing partitions
cross-validation
different selection criteria and stopping rules

18.8.2 Common selection methods

Some common methods include:

FORWARD
BACKWARD
STEPWISE
LASSO

18.9 Example: Model Selection with a Toy Dataset

To make the ideas concrete, consider a small dataset with a practical meaning.

Suppose we want to predict a student’s final exam score using:

hours = weekly study hours
attendance = class attendance percentage
homework = homework average
sleep = average hours of sleep per night
parttime = part-time work hours per week

Our goal is to decide which predictors should be included in the model.

18.10 Step 1: Create the Dataset

DATA studentperf;
    INPUT hours attendance homework sleep parttime final;
    DATALINES;
5  92 88 7.5  5 84
8  95 91 7.0  2 90
3  85 80 6.5 12 72
10 98 94 7.2  0 95
6  90 86 8.0  8 83
4  88 78 6.8 15 74
9  96 93 7.1  4 92
7  91 89 7.4  6 87
2  80 75 6.0 18 68
11 99 96 7.3  0 97
1  78 70 5.8 20 63
6  89 84 7.6 10 81
8  94 90 7.0  3 89
5  87 82 6.9 14 77
9  97 95 7.8  1 94
3  83 76 6.4 16 70
7  92 88 7.1  7 86
4  86 79 6.7 13 75
10 98 97 7.5  0 96
2  81 73 6.2 17 67
;
RUN;

18.10.1 Variable meanings

final = final exam score
hours = study hours per week
attendance = attendance percentage
homework = homework average
sleep = average sleep per night
parttime = work hours per week

18.11 Why This Is a Model Selection Problem

We have several candidate predictors, but not all of them may be equally useful.

Questions include:

Do we need all predictors?
Which predictors improve prediction?
Can a simpler model perform as well as a larger model?

18.12 Step 2: Fit a Full Model

PROC REG DATA=studentperf;
    MODEL final = hours attendance homework sleep parttime;
RUN;
QUIT;

18.12.1 Discussion

This model includes all candidate predictors.

Questions to ask:

Are all predictors significant?
Are some predictors redundant?
Can a smaller model perform similarly?

Important

A full model is not always the best model.

Some predictors may add complexity without adding much predictive value.

18.13 Step 3: Use `PROC GLMSELECT` with Cross-Validation

PROC GLMSELECT DATA=studentperf PLOTS=ALL;
    MODEL final = hours attendance homework sleep parttime
          / SELECTION=STEPWISE
            CHOOSE=CV
            CVMETHOD=RANDOM(5);
RUN;

18.13.1 What this does

SELECTION=STEPWISE asks SAS to perform stepwise model selection
CHOOSE=CV tells SAS to choose the model using cross-validation
CVMETHOD=RANDOM(5) uses 5-fold cross-validation

This helps us select a model that balances:

prediction accuracy
model simplicity

18.14 Step 4: Alternative Selection with AIC

You can also use AIC as the selection criterion.

PROC GLMSELECT DATA=studentperf PLOTS=ALL;
    MODEL final = hours attendance homework sleep parttime
          / SELECTION=FORWARD
            CHOOSE=AIC;
RUN;

18.14.1 Interpretation

This version uses:

SELECTION=FORWARD for forward selection
CHOOSE=AIC to pick the model with the best AIC

So this is still model selection, but based on a different criterion.

18.15 How to Read the ASE Plot

When you use PLOTS=ALL, one useful graph is the ASE plot.

The ASE plot helps compare:

training error
validation or cross-validation error

18.15.1 Interpretation

Training error usually decreases as the model becomes more complex
Validation or test error may decrease at first, then increase
If validation error starts increasing, the model may be overfitting
The best model is often near the minimum of the validation or test ASE curve

Important

A smaller training error does not always mean a better model.

For prediction, validation or test error is more informative.

18.16 In-Class Questions

Why is training MSE usually too optimistic as an estimate of prediction error?

What is the main purpose of K-fold cross-validation?

If a more complex model has lower training error but higher test error, what problem is occurring?

What does a smaller AIC or BIC usually indicate?

Why is PROC GLMSELECT useful when there are many candidate predictors and interactions?

Which variables seem most useful for predicting final exam score in this example?

18.17 Teaching Interpretation of the Toy Example

This example has a natural real-world meaning:

More study hours may improve exam performance
Better attendance may help
Strong homework performance may indicate understanding
Too many part-time work hours may reduce study time
Sleep may matter, but perhaps less strongly than the others

This makes it easier to think about:

variable importance
model simplicity
prediction versus interpretation

18.18 What to Remember from This Lecture

The central question in model selection is:

Which model gives the best balance between fit, prediction accuracy, and simplicity?

There are two major perspectives:

Prediction-based selection
- training/test split
- validation data
- cross-validation
Criterion-based selection
- AIC
- BIC

In SAS, PROC GLMSELECT provides a practical way to carry out model selection with both classical and modern options.

Key Takeaways

Model selection is about choosing a model that is useful, not just complicated
Prediction error is often more important than training error
Cross-validation helps estimate out-of-sample performance
AIC and BIC balance fit and complexity
PROC GLMSELECT is a powerful SAS tool for model selection
Overfitting occurs when a model fits training data well but performs poorly on new data

18.1 Introduction

18.2 Why Model Selection Matters

18.3 Bias-Variance Trade-Off

18.3.1 Intuition

18.4 Why Training Error Is Not Enough

18.5 Cross-Validation

18.5.1 Basic idea

18.6 K-Fold Cross-Validation

18.7 Information Criteria: AIC and BIC

18.7.1 Main idea

18.7.2 Interpretation

18.8 PROC GLMSELECT

18.8.1 Why PROC GLMSELECT is useful

18.8.2 Common selection methods

18.9 Example: Model Selection with a Toy Dataset

18.10 Step 1: Create the Dataset

18.10.1 Variable meanings

18.11 Why This Is a Model Selection Problem

18.12 Step 2: Fit a Full Model

18.12.1 Discussion

18.13 Step 3: Use PROC GLMSELECT with Cross-Validation

18.13.1 What this does

18.14 Step 4: Alternative Selection with AIC

18.14.1 Interpretation

18.15 How to Read the ASE Plot

18.15.1 Interpretation

18.16 In-Class Questions

18.17 Teaching Interpretation of the Toy Example

18.18 What to Remember from This Lecture

18.8 `PROC GLMSELECT`

18.8.1 Why `PROC GLMSELECT` is useful

18.13 Step 3: Use `PROC GLMSELECT` with Cross-Validation