21 Introduction to Machine Learning in SAS

Learning Objectives

By the end of this lecture, you should be able to:

Explain what machine learning means in a predictive modelling context.

Distinguish between supervised and unsupervised learning.

Understand the basic ideas behind decision trees, random forests, and regularization.

Compare classical statistical models with more flexible machine learning methods.

Implement basic machine learning workflows in SAS for prediction and model comparison.

21.1 Introduction

In the previous lecture, we introduced predictive modelling and emphasized the importance of evaluating models using out-of-sample performance. That lecture served as a bridge from classical statistical modelling to modern prediction.

In this lecture, we take one more step and ask:

What do people mean by machine learning (ML)?
How is it related to regression and classification?
What kinds of predictive methods are available in SAS?
When might a more flexible model be useful?

The goal of this lecture is not to turn this course into a full machine learning course. Instead, the goal is to introduce several important ideas in a way that is connected to the statistical models you already know.

21.2 What Is ML?

There is no single perfect definition of machine learning, but in many practical settings it refers to:

using data-driven algorithms to learn patterns from data for prediction, classification, or discovery.

Compared with traditional statistical modelling, machine learning often places stronger emphasis on:

prediction,
flexibility,
automatic pattern detection,
performance on unseen data.

21.2.1 Important Note

Machine learning is not separate from statistics. Many machine learning methods are extensions of ideas you already know:

regression,
classification,
optimization,
variable selection,
bias-variance trade-off.

In many real applications, the boundary between statistics and machine learning is not sharp.

21.3 Supervised vs Unsupervised Learning

A useful first distinction is between supervised learning and unsupervised learning.

Supervised Learning

In supervised learning, we observe:

predictors \(X\),
a response \(Y\).

The goal is to learn a rule that predicts \(Y\) from \(X\).

predicting house price from location and size,
predicting whether a patient has a disease,
predicting whether an email is spam.

Typical supervised learning methods include:

linear regression,
logistic regression,
decision trees,
random forests,
support vector machines,
neural networks.

Unsupervised Learning

In unsupervised learning, we observe predictors \(X\) but no response variable.

The goal is to discover structure in the data.

grouping customers into segments,
detecting clusters,
dimension reduction.

Typical unsupervised learning methods include:

clustering,
principal component analysis.

21.3.1 Focus of This Lecture

We focus mainly on supervised learning, because it connects directly to predictive modelling and to many SAS procedures.

21.4 Classical Models vs ML Models

Classical statistical models often emphasize:

interpretability,
effect estimation,
statistical significance,
model assumptions.

ML methods often emphasize:

predictive accuracy,
flexibility,
automatic adaptation to nonlinear patterns,
performance on new data.

Comparison

Perspective	Classical Statistical Modelling	Machine Learning
Main goal	explanation and inference	prediction
Focus	parameter interpretation	predictive performance
Model form	often specified in advance	often more flexible
Evaluation	significance, AIC, BIC	test error, accuracy, AUC

21.4.1 Important Message

This is a difference in emphasis, not a strict separation. For example:

linear regression can be used as a predictive model,
machine learning models can sometimes still be interpreted.

21.5 Why Flexible Models Can Help

A linear model assumes a relatively simple relationship between the response and predictors.

But in real data, relationships may be:

nonlinear,
involve interactions,
vary across different parts of the predictor space,
too complex to describe with a simple formula.

More flexible models can adapt to such patterns.

However, flexibility comes with a cost:

a flexible model may overfit,
interpretation may become harder,
tuning becomes more important.

This is why we must always evaluate models on new data.

21.6 Decision Trees

One of the most intuitive machine learning methods is the decision tree.

21.6.1 Basic Idea

A decision tree splits the predictor space into regions by asking a sequence of questions such as:

Is age < 40?
Is income > 60,000?
Is blood pressure > 130?

For regression, each terminal node gives a predicted numeric value.
For classification, each terminal node gives a predicted class or class probability.

21.6.2 Why Students Usually Like Trees

Decision trees are appealing because they are:

easy to visualize,
easy to explain,
capable of capturing nonlinear relationships,
capable of handling interactions automatically.

A decision tree for loan default might work like this:

Split first on income.
Among lower-income individuals, split on credit score.
Among higher-income individuals, split on debt level.

The final prediction depends on the path through the tree.

Strengths and Weaknesses of Trees

Strengths
- easy to interpret,
- naturally capture interactions,
- do not require linearity,
- useful for both regression and classification.
Weaknesses
- can be unstable,
- small changes in data may lead to a different tree,
- a single tree may not predict as accurately as ensemble methods.

This motivates methods such as bagging and random forests.

21.7 Random Forests and Ensemble Thinking

A random forest is an ensemble method built from many decision trees.

21.7.1 Main Idea

Instead of relying on one tree, we build many trees using random variation in the data and predictors, then combine their predictions.

This often improves predictive performance because:

a single tree may be noisy or unstable,
averaging many trees reduces variability,
the combined model is often more robust.

21.7.2 Why Random Forests Work Well

Random forests can:

model nonlinear relationships,
handle many predictors,
capture interactions,
provide strong predictive accuracy in many settings.

21.7.3 Trade-off

Compared with a single tree, a random forest is usually:

more accurate,
less interpretable.

This is a common pattern in ML: better predictive performance may come at the cost of simpler interpretation.

21.8 Regularization and High-Dimensional Prediction

Another important machine learning idea is regularization.

21.8.1 Motivation

Suppose we have many predictors.

Problems may include:

overfitting,
unstable coefficient estimates,
multicollinearity,
poor test performance.

Regularization adds a penalty that discourages overly complex models.

21.8.2 Example: LASSO

LASSO regression estimates coefficients by balancing:

goodness of fit,
model complexity.

As a result, some coefficients may be shrunk toward zero, and some may become exactly zero.

This makes LASSO useful for:

variable selection,
prediction in larger models,
controlling overfitting.

21.8.3 Big Picture

Regularization helps us move from classical regression toward modern predictive modelling in a principled way.

21.9 A Simple Machine Learning Workflow in SAS

A practical ML workflow in SAS looks like this:

split the data into training and test sets,
fit multiple candidate models,
score the models on the test set,
compare predictive performance,
choose a model based on out-of-sample results.

This is similar to the workflow from the previous lecture, but now we may compare more flexible models.

21.10 SAS Example: Decision Tree

One useful SAS procedure for tree-based methods is PROC HPSPLIT.

21.10.1 Example: Classification Tree

PROC HPSPLIT DATA=train;
    CLASS response;
    MODEL response = X1 X2 X3 X4 X5;
    GROW GINI;
    PRUNE COSTCOMPLEXITY;
    CODE FILE='tree_score.sas';
RUN;

21.10.2 Explanation

GROW GINI is a common splitting rule for classification.
PRUNE COSTCOMPLEXITY reduces overfitting by simplifying the tree.
CODE FILE=... creates scoring code that can be applied to new data.

21.10.3 Scoring the Test Set

DATA tree_scored;
    SET test;
    %INCLUDE 'tree_score.sas';
RUN;

After scoring, we can compare predicted classes to observed classes.

21.11 SAS Example: Random Forest

A useful SAS procedure for random forests is PROC HPFOREST.

PROC HPFOREST DATA=train;
    TARGET response / LEVEL=nominal;
    INPUT X1 X2 X3 X4 X5 / LEVEL=interval;
    SCORE OUT=forest_train;
RUN;

Depending on your SAS environment, you may also use procedure-specific scoring output for validation or test datasets.

21.11.1 Interpretation

The random forest builds many trees and aggregates their predictions. In practice, this often gives better predictive accuracy than a single tree.

21.12 SAS Example: Regularization with GLMSELECT

SAS also provides tools for regression-based model selection and shrinkage.

A useful procedure is PROC GLMSELECT.

PROC GLMSELECT DATA=train;
    MODEL Y = X1 X2 X3 X4 X5 / SELECTION=LASSO;
    PARTITION FRACTION(TEST=0.3);
RUN;

21.12.1 Why This Is Useful

This procedure illustrates an important idea:

classical regression and machine learning are connected,
variable selection can be handled in a prediction-oriented framework,
model complexity can be controlled automatically.

21.13 Comparing Models

One of the most important lessons in ML is that we should compare models using the same evaluation framework.

For example, we might compare:

logistic regression,
a classification tree,
a random forest.

21.13.1 Questions to Ask

Which method has the highest test accuracy?
Which method gives the best balance between performance and interpretability?
Is the improvement in prediction large enough to justify a more complex model?

21.13.2 Example Discussion

A logistic regression model may be:

easier to interpret,
easier to communicate,
better when the relationship is approximately linear.

A tree-based model may be:

better at capturing nonlinear relationships,
better at capturing interactions,
harder to summarize with a few coefficients.

A random forest may be:

even more accurate,
but much less transparent.

21.14 Bias-Variance Trade-off

A central concept in machine learning is the bias-variance trade-off.

21.14.1 Informal Idea

a very simple model may be too rigid and miss important patterns,
a very flexible model may follow noise too closely.

So good prediction often requires balancing:

bias: error from being too simple,
variance: error from being too sensitive to the sample.

21.14.2 Connection to Earlier Topics

This idea is closely related to:

overfitting,
model selection,
regularization,
test-set evaluation.

In other words, machine learning builds on ideas that are already present in statistics.

21.15 Why This Matters for SAS Users

Students sometimes think machine learning is something completely separate that requires a different software ecosystem.

That is not true.

SAS supports important ML workflows, including:

data splitting,
regression-based prediction,
tree-based methods,
forests and ensemble methods,
regularization and model comparison.

This means the tools you have learned in SAS can be extended naturally into modern predictive analytics.

21.16 Practical Advice

When using machine learning methods, remember:

Do not evaluate performance only on the training set.
A more complex model is not automatically better.
Interpretability still matters in many applications.
Always compare models using a clear test-set or validation framework.
Good predictive modelling requires both statistical thinking and computational implementation.

21.17 Looking Ahead

This lecture is only an introduction, but it shows how SAS can be used for modern predictive analysis.

After this course, students interested in advanced topics could continue with:

cross-validation,
boosting,
support vector machines,
neural networks,
deep learning,
high-dimensional data analysis.

The key point is that the ideas from this course already provide a strong foundation for these topics.

21.18 Chapter Summary

In this lecture, we introduced machine learning as an extension of predictive modelling.

21.18.1 Main Ideas

Machine learning emphasizes prediction, flexibility, and out-of-sample performance.
Supervised learning uses predictors and a response; unsupervised learning looks for structure without a response.
Decision trees provide intuitive nonlinear prediction rules.
Random forests improve prediction by aggregating many trees.
Regularization helps control model complexity and improve prediction.

21.18.2 SAS Skills Introduced

tree modelling with PROC HPSPLIT,
random forests with PROC HPFOREST,
regularized regression with PROC GLMSELECT,
model comparison using predictive performance.

21.18.3 Final Message

Machine learning does not replace statistics.
It extends statistical modelling when prediction and flexibility become central goals.

21.19 Practice Problems

Explain the difference between supervised and unsupervised learning.
Why might a decision tree outperform a linear model in some datasets?
What is the main advantage of a random forest over a single tree?
Why does regularization help when many predictors are available?
Compare logistic regression and a tree-based classifier in terms of interpretability and flexibility.
In SAS, which procedures introduced in this lecture would you consider for:
- a classification tree,
- a random forest,
- LASSO regression?