10Week 9 — Bayesian Model Averaging and Ensemble Learning
This week introduces Bayesian Model Averaging (BMA), a principled framework to combine inferences from multiple Bayesian models, and contrasts it with ensemble methods common in machine learning.
We discuss model uncertainty, predictive averaging, and practical implementations for linear regression and classification.
10.1 Learning Goals
By the end of this week, you should be able to:
Explain the motivation for Bayesian Model Averaging.
Derive model-averaged predictions using posterior model probabilities.
Compare BMA with frequentist model selection and ML ensembles.
Implement BMA for simple regression models in R.
Discuss advantages and limitations of Bayesian model combination.
10.2 Lecture 1 — Bayesian Model Averaging (BMA)
10.2.1 1.1 Model Uncertainty
Model selection often ignores uncertainty about which model is true.
BMA accounts for this by averaging over all candidate models weighted by their posterior probabilities.
Interpretation: The model-averaged prediction blends the strengths of both models, weighted by their posterior support.
10.2.5 1.5 Advantages of BMA
Incorporates model uncertainty directly.
Avoids overconfidence from single-model conditioning.
Improves predictive performance, especially in small samples.
Provides model weights interpretable as probabilities.
10.2.6 1.6 Limitations
Requires marginal likelihoods (often hard to compute).
Sensitive to model priors and parameter priors.
Computationally expensive for many models.
10.3 Lecture 2 — Bayesian Ensembles and Predictive Stacking
10.3.1 2.1 Beyond BMA: Ensemble Learning
Machine learning often uses ensembles (e.g., bagging, boosting, stacking) to improve prediction.
Bayesian analogues combine predictive distributions rather than point estimates.
10.3.2 2.2 Predictive Stacking
Rather than using posterior model probabilities, stacking optimizes weights to maximize predictive performance under cross-validation: \[
w^* = \arg\max_{w} \sum_{i=1}^n \log\left(\sum_k w_k\, p(y_i \mid y_{-i}, M_k)\right),
\] subject to \(w_k \\ge 0\) and \(\\sum_k w_k = 1\).
This yields stacking weights that combine models for best out-of-sample prediction.
10.3.3 2.3 Example — Predictive Stacking with loo
library(brms)library(loo)set.seed(10)dat <-data.frame(x =rnorm(200))dat$y <-1+2*dat$x +0.5*dat$x^2+rnorm(200)m1 <-brm(y ~ x, data=dat, refresh=0)m2 <-brm(y ~ x +I(x^2), data=dat, refresh=0)loo1 <-loo(m1)loo2 <-loo(m2)# Stacking weights based on LOO predictive densitiesw_stack <-loo_model_weights(list(m1,m2), method="stacking")w_pseudo <-loo_model_weights(list(m1,m2), method="pseudobma")w_stackw_pseudo