10  Week 9 — Bayesian Model Averaging and Ensemble Learning

This week introduces Bayesian Model Averaging (BMA), a principled framework to combine inferences from multiple Bayesian models, and contrasts it with ensemble methods common in machine learning.
We discuss model uncertainty, predictive averaging, and practical implementations for linear regression and classification.


10.1 Learning Goals

By the end of this week, you should be able to:

  • Explain the motivation for Bayesian Model Averaging.
  • Derive model-averaged predictions using posterior model probabilities.
  • Compare BMA with frequentist model selection and ML ensembles.
  • Implement BMA for simple regression models in R.
  • Discuss advantages and limitations of Bayesian model combination.

10.2 Lecture 1 — Bayesian Model Averaging (BMA)

10.2.1 1.1 Model Uncertainty

Model selection often ignores uncertainty about which model is true.
BMA accounts for this by averaging over all candidate models weighted by their posterior probabilities.

For models \(M_1, \ldots, M_K\): \[ p(M_k \mid y) = \frac{p(y \mid M_k)\,p(M_k)}{\sum_{j=1}^K p(y \mid M_j)\,p(M_j)}. \]

Here: - \(p(y \\mid M_k)\) = marginal likelihood under model \(M_k\).
- \(p(M_k)\) = prior model probability.
- \(p(M_k \\mid y)\) = posterior model probability.


10.2.2 1.2 Model-Averaged Posterior and Predictions

Posterior distribution for parameter θ: \[ p(\theta \mid y) = \sum_{k=1}^K p(\theta \mid y, M_k)\,p(M_k \mid y). \]

Posterior predictive distribution: \[ p(\tilde{y} \mid y) = \sum_{k=1}^K p(\tilde{y} \mid y, M_k)\,p(M_k \mid y). \]

BMA integrates out model uncertainty rather than conditioning on a single “best” model.


10.2.3 1.3 Comparison with Model Selection

Approach Key Idea Limitation
Model Selection Choose one best model (e.g., by AIC, WAIC, LOO) Ignores model uncertainty
Model Averaging Combine all models weighted by posterior probability Computationally heavier, prior sensitive

10.2.4 1.4 Example — Two Competing Linear Models

set.seed(9)
n <- 100
x <- rnorm(n)
y <- 1 + 2*x + 0.5*x^2 + rnorm(n, sd=1)

m1 <- lm(y ~ x)
m2 <- lm(y ~ x + I(x^2))

log_marglik1 <- -AIC(m1)/2
log_marglik2 <- -AIC(m2)/2
p_m1 <- exp(log_marglik1)
p_m2 <- exp(log_marglik2)

w1 <- p_m1 / (p_m1 + p_m2)
w2 <- p_m2 / (p_m1 + p_m2)

pred1 <- predict(m1)
pred2 <- predict(m2)
bma_pred <- w1*pred1 + w2*pred2

c(weights=c(M1=w1, M2=w2)[1:2])
  weights.M1   weights.M2 
1.426496e-08 1.000000e+00 
plot(x, y, pch=19, col="#00000055", main="Bayesian Model Averaging (Linear vs Quadratic)",
     xlab="x", ylab="y")
xs <- seq(min(x), max(x), length.out=200)
lines(xs, predict(m1, newdata=data.frame(x=xs)), col="steelblue", lwd=2)
lines(xs, predict(m2, newdata=data.frame(x=xs)), col="firebrick", lwd=2)
lines(xs, w1*predict(m1, newdata=data.frame(x=xs)) +
          w2*predict(m2, newdata=data.frame(x=xs)),
      col="darkgreen", lwd=3, lty=2)
legend("topleft", legend=c("Model 1 (linear)","Model 2 (quadratic)","BMA prediction"),
       col=c("steelblue","firebrick","darkgreen"), lwd=c(2,2,3), lty=c(1,1,2), bty="n")

Model-averaged predictions vs. data

Interpretation: The model-averaged prediction blends the strengths of both models, weighted by their posterior support.


10.2.5 1.5 Advantages of BMA

  • Incorporates model uncertainty directly.
  • Avoids overconfidence from single-model conditioning.
  • Improves predictive performance, especially in small samples.
  • Provides model weights interpretable as probabilities.

10.2.6 1.6 Limitations

  • Requires marginal likelihoods (often hard to compute).
  • Sensitive to model priors and parameter priors.
  • Computationally expensive for many models.

10.3 Lecture 2 — Bayesian Ensembles and Predictive Stacking

10.3.1 2.1 Beyond BMA: Ensemble Learning

Machine learning often uses ensembles (e.g., bagging, boosting, stacking) to improve prediction.
Bayesian analogues combine predictive distributions rather than point estimates.


10.3.2 2.2 Predictive Stacking

Rather than using posterior model probabilities, stacking optimizes weights to maximize predictive performance under cross-validation: \[ w^* = \arg\max_{w} \sum_{i=1}^n \log\left(\sum_k w_k\, p(y_i \mid y_{-i}, M_k)\right), \] subject to \(w_k \\ge 0\) and \(\\sum_k w_k = 1\).

This yields stacking weights that combine models for best out-of-sample prediction.


10.3.3 2.3 Example — Predictive Stacking with loo

library(brms)
library(loo)

set.seed(10)
dat <- data.frame(x = rnorm(200))
dat$y <- 1 + 2*dat$x + 0.5*dat$x^2 + rnorm(200)

m1 <- brm(y ~ x, data=dat, refresh=0)
m2 <- brm(y ~ x + I(x^2), data=dat, refresh=0)

loo1 <- loo(m1)
loo2 <- loo(m2)

# Stacking weights based on LOO predictive densities
w_stack <- loo_model_weights(list(m1,m2), method="stacking")
w_pseudo <- loo_model_weights(list(m1,m2), method="pseudobma")

w_stack
w_pseudo

Interpretation:
- Stacking weights directly optimize predictive log-likelihood.
- Pseudo-BMA provides a simpler (WAIC/LOO-based) approximation.


10.3.4 2.4 Comparison: BMA vs Stacking

Feature Bayesian Model Averaging Predictive Stacking
Weights Posterior model probabilities Optimized predictive weights
Goal Represent model uncertainty Maximize predictive performance
Computation Needs marginal likelihoods Uses cross-validation
Prior dependence Sensitive Weak or none
Typical use Theoretical coherence Practical prediction

10.3.5 2.5 Ensemble Prediction Example

set.seed(11)
n <- 100
x <- rnorm(n)
y_true <- 2 + 3*x - 1.5*x^2
y <- y_true + rnorm(n, sd=2)

m1 <- lm(y ~ x)
m2 <- lm(y ~ poly(x, 2, raw=TRUE))

pred_grid <- seq(min(x), max(x), length=200)
p1 <- predict(m1, newdata=data.frame(x=pred_grid))
p2 <- predict(m2, newdata=data.frame(x=pred_grid))

# Ensemble weighting (ad hoc stacking weights)
w1 <- 0.3; w2 <- 0.7
p_ens <- w1*p1 + w2*p2

plot(x, y, pch=19, col="#00000055", main="Model Ensemble Prediction",
     xlab="x", ylab="y")
lines(pred_grid, p1, col="blue", lwd=2)
lines(pred_grid, p2, col="red", lwd=2)
lines(pred_grid, p_ens, col="darkgreen", lwd=3, lty=2)
legend("topleft", legend=c("Model 1","Model 2","Ensemble"),
       col=c("blue","red","darkgreen"), lwd=c(2,2,3), lty=c(1,1,2), bty="n")


10.3.6 2.6 Practical Guidance

  • Use BMA when posterior model probabilities are available (few models, interpretable priors).
  • Use stacking or ensemble averaging when prediction accuracy is the goal.
  • Avoid double counting data — always base weights on held-out or cross-validation predictive performance.

10.4 Homework 9

  1. Conceptual
    • Explain how BMA differs from model selection.
    • Why does stacking avoid prior sensitivity found in BMA?
  2. Computational
    • Simulate data where two Bayesian regression models compete.
    • Fit both models in R (e.g., using brms or lm).
    • Compute stacking and pseudo-BMA weights using loo_model_weights().
    • Compare model-averaged predictions to the true curve.
  3. Reflection
    • Discuss when BMA and stacking might give very different results.
    • How can model averaging improve scientific interpretability?

10.5 Key Takeaways

Concept Summary
Bayesian Model Averaging Combines models weighted by posterior probabilities.
Predictive Stacking Chooses weights that maximize predictive accuracy via cross-validation.
Model Uncertainty Accounted for rather than ignored.
Practical Use BMA for interpretability; stacking for prediction.
Modern Tools loo_model_weights() in R provides both stacking and pseudo-BMA weights.

Next Week: Bayesian Nonparametrics — infinite-dimensional models such as Dirichlet processes and Gaussian processes for flexible Bayesian modeling.