10 Week 9 — Bayesian Model Averaging and Ensemble Learning

This week introduces Bayesian Model Averaging (BMA), a principled framework to combine inferences from multiple Bayesian models, and contrasts it with ensemble methods common in machine learning.
We discuss model uncertainty, predictive averaging, and practical implementations for linear regression and classification.

10.1 Learning Goals

By the end of this week, you should be able to:

Explain the motivation for Bayesian Model Averaging.
Derive model-averaged predictions using posterior model probabilities.
Compare BMA with frequentist model selection and ML ensembles.
Implement BMA for simple regression models in R.
Discuss advantages and limitations of Bayesian model combination.

10.2 Lecture 1 — Bayesian Model Averaging (BMA)

10.2.1 1.1 Model Uncertainty

Model selection often ignores uncertainty about which model is true.
BMA accounts for this by averaging over all candidate models weighted by their posterior probabilities.

For models \(M_1, \ldots, M_K\): \[ p(M_k \mid y) = \frac{p(y \mid M_k)\,p(M_k)}{\sum_{j=1}^K p(y \mid M_j)\,p(M_j)}. \]

Here: - \(p(y \\mid M_k)\) = marginal likelihood under model \(M_k\).
- \(p(M_k)\) = prior model probability.
- \(p(M_k \\mid y)\) = posterior model probability.

10.2.2 1.2 Model-Averaged Posterior and Predictions

Posterior distribution for parameter θ: \[ p(\theta \mid y) = \sum_{k=1}^K p(\theta \mid y, M_k)\,p(M_k \mid y). \]

Posterior predictive distribution: \[ p(\tilde{y} \mid y) = \sum_{k=1}^K p(\tilde{y} \mid y, M_k)\,p(M_k \mid y). \]

BMA integrates out model uncertainty rather than conditioning on a single “best” model.

10.2.3 1.3 Comparison with Model Selection

Approach	Key Idea	Limitation
Model Selection	Choose one best model (e.g., by AIC, WAIC, LOO)	Ignores model uncertainty
Model Averaging	Combine all models weighted by posterior probability	Computationally heavier, prior sensitive

10.2.4 1.4 Example — Two Competing Linear Models

set.seed(9)
n <- 100
x <- rnorm(n)
y <- 1 + 2*x + 0.5*x^2 + rnorm(n, sd=1)

m1 <- lm(y ~ x)
m2 <- lm(y ~ x + I(x^2))

log_marglik1 <- -AIC(m1)/2
log_marglik2 <- -AIC(m2)/2
p_m1 <- exp(log_marglik1)
p_m2 <- exp(log_marglik2)

w1 <- p_m1 / (p_m1 + p_m2)
w2 <- p_m2 / (p_m1 + p_m2)

pred1 <- predict(m1)
pred2 <- predict(m2)
bma_pred <- w1*pred1 + w2*pred2

c(weights=c(M1=w1, M2=w2)[1:2])

  weights.M1   weights.M2 
1.426496e-08 1.000000e+00

plot(x, y, pch=19, col="#00000055", main="Bayesian Model Averaging (Linear vs Quadratic)",
     xlab="x", ylab="y")
xs <- seq(min(x), max(x), length.out=200)
lines(xs, predict(m1, newdata=data.frame(x=xs)), col="steelblue", lwd=2)
lines(xs, predict(m2, newdata=data.frame(x=xs)), col="firebrick", lwd=2)
lines(xs, w1*predict(m1, newdata=data.frame(x=xs)) +
          w2*predict(m2, newdata=data.frame(x=xs)),
      col="darkgreen", lwd=3, lty=2)
legend("topleft", legend=c("Model 1 (linear)","Model 2 (quadratic)","BMA prediction"),
       col=c("steelblue","firebrick","darkgreen"), lwd=c(2,2,3), lty=c(1,1,2), bty="n")

Interpretation: The model-averaged prediction blends the strengths of both models, weighted by their posterior support.

10.2.5 1.5 Advantages of BMA

Incorporates model uncertainty directly.
Avoids overconfidence from single-model conditioning.
Improves predictive performance, especially in small samples.
Provides model weights interpretable as probabilities.

10.2.6 1.6 Limitations

Requires marginal likelihoods (often hard to compute).
Sensitive to model priors and parameter priors.
Computationally expensive for many models.

10.3 Lecture 2 — Bayesian Ensembles and Predictive Stacking

10.3.1 2.1 Beyond BMA: Ensemble Learning

Machine learning often uses ensembles (e.g., bagging, boosting, stacking) to improve prediction.
Bayesian analogues combine predictive distributions rather than point estimates.

10.3.2 2.2 Predictive Stacking

Rather than using posterior model probabilities, stacking optimizes weights to maximize predictive performance under cross-validation: \[ w^* = \arg\max_{w} \sum_{i=1}^n \log\left(\sum_k w_k\, p(y_i \mid y_{-i}, M_k)\right), \] subject to \(w_k \\ge 0\) and \(\\sum_k w_k = 1\).

This yields stacking weights that combine models for best out-of-sample prediction.

10.3.3 2.3 Example — Predictive Stacking with `loo`

library(brms)
library(loo)

set.seed(10)
dat <- data.frame(x = rnorm(200))
dat$y <- 1 + 2*dat$x + 0.5*dat$x^2 + rnorm(200)

m1 <- brm(y ~ x, data=dat, refresh=0)
m2 <- brm(y ~ x + I(x^2), data=dat, refresh=0)

loo1 <- loo(m1)
loo2 <- loo(m2)

# Stacking weights based on LOO predictive densities
w_stack <- loo_model_weights(list(m1,m2), method="stacking")
w_pseudo <- loo_model_weights(list(m1,m2), method="pseudobma")

w_stack
w_pseudo

Interpretation:
- Stacking weights directly optimize predictive log-likelihood.
- Pseudo-BMA provides a simpler (WAIC/LOO-based) approximation.

10.3.4 2.4 Comparison: BMA vs Stacking

Feature	Bayesian Model Averaging	Predictive Stacking
Weights	Posterior model probabilities	Optimized predictive weights
Goal	Represent model uncertainty	Maximize predictive performance
Computation	Needs marginal likelihoods	Uses cross-validation
Prior dependence	Sensitive	Weak or none
Typical use	Theoretical coherence	Practical prediction

10.3.5 2.5 Ensemble Prediction Example

set.seed(11)
n <- 100
x <- rnorm(n)
y_true <- 2 + 3*x - 1.5*x^2
y <- y_true + rnorm(n, sd=2)

m1 <- lm(y ~ x)
m2 <- lm(y ~ poly(x, 2, raw=TRUE))

pred_grid <- seq(min(x), max(x), length=200)
p1 <- predict(m1, newdata=data.frame(x=pred_grid))
p2 <- predict(m2, newdata=data.frame(x=pred_grid))

# Ensemble weighting (ad hoc stacking weights)
w1 <- 0.3; w2 <- 0.7
p_ens <- w1*p1 + w2*p2

plot(x, y, pch=19, col="#00000055", main="Model Ensemble Prediction",
     xlab="x", ylab="y")
lines(pred_grid, p1, col="blue", lwd=2)
lines(pred_grid, p2, col="red", lwd=2)
lines(pred_grid, p_ens, col="darkgreen", lwd=3, lty=2)
legend("topleft", legend=c("Model 1","Model 2","Ensemble"),
       col=c("blue","red","darkgreen"), lwd=c(2,2,3), lty=c(1,1,2), bty="n")

10.3.6 2.6 Practical Guidance

Use BMA when posterior model probabilities are available (few models, interpretable priors).
Use stacking or ensemble averaging when prediction accuracy is the goal.
Avoid double counting data — always base weights on held-out or cross-validation predictive performance.

10.4 Homework 9

Conceptual
- Explain how BMA differs from model selection.
- Why does stacking avoid prior sensitivity found in BMA?
Computational
- Simulate data where two Bayesian regression models compete.
- Fit both models in R (e.g., using brms or lm).
- Compute stacking and pseudo-BMA weights using loo_model_weights().
- Compare model-averaged predictions to the true curve.
Reflection
- Discuss when BMA and stacking might give very different results.
- How can model averaging improve scientific interpretability?

10.5 Key Takeaways

Concept	Summary
Bayesian Model Averaging	Combines models weighted by posterior probabilities.
Predictive Stacking	Chooses weights that maximize predictive accuracy via cross-validation.
Model Uncertainty	Accounted for rather than ignored.
Practical Use	BMA for interpretability; stacking for prediction.
Modern Tools	`loo_model_weights()` in R provides both stacking and pseudo-BMA weights.

Next Week: Bayesian Nonparametrics — infinite-dimensional models such as Dirichlet processes and Gaussian processes for flexible Bayesian modeling.