9  Week 8 — Advanced Bayesian Computation

This week explores two major developments that enable scalable Bayesian inference for complex or high-dimensional models:
Hamiltonian Monte Carlo (HMC) and Variational Inference (VI).
We study their principles, intuition, and practical use in software such as Stan and brms.


9.1 Learning Goals

By the end of this week, you should be able to:

  • Explain the motivation for advanced sampling and approximation methods.
  • Describe the mechanics and intuition of Hamiltonian Monte Carlo.
  • Understand the trade-offs between exact (MCMC) and approximate (VI) inference.
  • Run basic HMC and VI fits using modern R interfaces.
  • Interpret diagnostics for both approaches.

9.2 Lecture 1 — Hamiltonian Monte Carlo (HMC)

9.2.1 1.1 Motivation

Traditional MCMC (e.g., Metropolis–Hastings, Gibbs) can mix slowly in high dimensions.
Hamiltonian Monte Carlo accelerates exploration by using gradient information from the log posterior to simulate physical motion through parameter space.


9.2.2 1.2 Hamiltonian Dynamics

We introduce an auxiliary “momentum” variable \(p\) and define the Hamiltonian: \[ H(\theta,p) = U(\theta) + K(p), \] where
- \(U(\theta) = -\log p(\theta\mid y)\) (potential energy = negative log posterior),
- \(K(p) = \tfrac{1}{2} p^\top M^{-1}p\) (kinetic energy, with mass matrix \(M\)).

The system evolves via Hamilton’s equations: \[ \frac{d\theta}{dt} = \frac{\partial H}{\partial p}, \qquad \frac{dp}{dt} = -\frac{\partial H}{\partial \theta}. \]


9.2.3 1.3 Leapfrog Integration

To approximate continuous motion, HMC uses a leapfrog integrator with step size \(\epsilon\) and \(L\) steps:

  1. \(p\_{t+\epsilon/2} = p_t - \frac{\epsilon}{2}\nabla\_\theta U(\theta_t)\)
  2. \(\theta\_{t+\epsilon} = \theta_t + \epsilon M^{-1}p\_{t+\epsilon/2}\)
  3. \(p\_{t+\epsilon} = p\_{t+\epsilon/2} - \frac{\epsilon}{2}\nabla\_\theta U(\theta\_{t+\epsilon})\)

After simulating this path, we apply a Metropolis acceptance step using the change in \(H\).


9.2.4 1.4 Intuition

  • The gradient \(\nabla\_\theta U(\theta)\) guides proposals along high-density regions, avoiding random walk behavior.
  • Proper tuning of step size \(\epsilon\) and number of steps \(L\) yields efficient exploration.
  • Modern implementations (e.g., Stan) adapt these automatically via the No-U-Turn Sampler (NUTS).

9.2.5 1.5 Example — Logistic Regression with HMC (Stan)

library(brms)
set.seed(8)

# Simulated logistic data
N <- 200
x <- rnorm(N)
y <- rbinom(N, 1, plogis(-1 + 2*x))
dat <- data.frame(x, y)

# Fit via Hamiltonian Monte Carlo (NUTS)
fit_hmc <- brm(y ~ x, data=dat, family=bernoulli(), chains=2, iter=2000, refresh=0)
summary(fit_hmc)
plot(fit_hmc)

Stan’s NUTS algorithm performs automatic adaptation of step size and trajectory length.


9.2.6 1.6 Diagnosing HMC Performance

Key diagnostics: - Divergent transitions → numerical instability (reduce step size or re-scale parameters).
- Energy Bayesian Fraction of Missing Information (E-BFMI) → low values (<0.3) indicate poor exploration.
- \(\widehat{R}\) and effective sample size → check convergence and mixing.

library(bayesplot)
mcmc_nuts_divergence(fit_hmc)
mcmc_trace(fit_hmc, pars=c("b_Intercept","b_x"))

9.2.7 1.7 Advantages of HMC

Feature Benefit
Gradient-based proposals Rapid movement through high-density regions
Higher acceptance rates Fewer rejections than random-walk MH
Fewer tuning parameters Automatic adaptation (NUTS)
Robust for high-dimensional models Used in most modern Bayesian software

9.3 Lecture 2 — Variational Inference (VI)

9.3.1 2.1 Motivation

When exact sampling is too costly (e.g., massive datasets, deep models),
Variational Inference (VI) approximates the posterior by a simpler distribution \(q\_\lambda(\theta)\) within a parameterized family.


9.3.2 2.2 Objective Function

We minimize the Kullback–Leibler (KL) divergence: \[ \text{KL}(q_\lambda(\theta) \,\|\, p(\theta\mid y)) = \int q_\lambda(\theta)\log\frac{q_\lambda(\theta)}{p(\theta\mid y)}\,d\theta. \]

Equivalently, we maximize the Evidence Lower Bound (ELBO): \[ \text{ELBO}(\lambda) = E_{q_\lambda}[\log p(y,\theta)] - E_{q_\lambda}[\log q_\lambda(\theta)]. \] The higher the ELBO, the closer \(q\_\lambda(\theta)\) is to the true posterior.


9.3.3 2.3 Mean-Field Approximation

A common simplification assumes factorization: \[ q_\lambda(\theta) = \prod_j q_{\lambda_j}(\theta_j), \] which allows coordinate-wise optimization of each factor.


9.3.4 2.4 Example — Variational Bayes for a Normal Mean

Assume \(y_i\sim N(\theta,1)\) with prior \(\theta\sim N(0,1)\).

Analytically, the posterior is \(N\left(\frac{n\bar{y}}{n+1}, \frac{1}{n+1}\right)\).
We approximate it variationally by another normal \(q(\theta)=N(m,s^2)\),
and find \(m,s^2\) maximizing ELBO.

set.seed(9)
y <- rnorm(50, mean=1)
n <- length(y)
log_joint <- function(theta) sum(dnorm(y, theta, 1, log=TRUE)) + dnorm(theta, 0, 1, log=TRUE)

# Closed-form optimal q is Normal(m,s^2) with same moments as true posterior:
m_vi <- n*mean(y)/(n+1)
s2_vi <- 1/(n+1)
c(mean=m_vi, sd=sqrt(s2_vi))
     mean        sd 
0.8884051 0.1400280 

9.3.5 2.5 Automatic VI with brms

library(brms)
set.seed(10)
N <- 1000
x <- rnorm(N)
y <- 2 + 1.5*x + rnorm(N)
dat <- data.frame(x,y)

fit_vi <- brm(y ~ x, data=dat, family=gaussian(),
              algorithm="meanfield", iter=5000, refresh=0)
summary(fit_vi)

VI provides a fast deterministic approximation, trading off accuracy for scalability.


9.3.6 2.6 Comparison: HMC vs VI

Feature HMC (NUTS) Variational Inference
Type Sampling (asymptotically exact) Optimization (approximate)
Accuracy Very high Depends on variational family
Speed Slower Very fast
Diagnostics Convergence via \(\widehat{R}\), ESS ELBO convergence
Use case Complex or small data Massive or real-time problems

9.3.7 2.7 Visual Comparison (Conceptual)

theta <- seq(-3,3,length=400)
posterior <- dnorm(theta, 0, 1)              # true posterior
vi_approx <- dnorm(theta, 0, 1.5)            # wider variational approx
plot(theta, posterior, type="l", lwd=2, col="black", ylim=c(0,0.5),
     main="Posterior (HMC) vs Variational Approximation",
     ylab="Density", xlab=expression(theta))
lines(theta, vi_approx, col="orange", lwd=2, lty=2)
legend("topright", legend=c("Exact posterior (HMC)","VI approximation"),
       col=c("black","orange"), lwd=2, lty=c(1,2), bty="n")


9.3.8 2.8 Practical Advice

  • Use HMC (NUTS) as the default for accuracy and diagnostics.
  • Use VI for large-scale models, initialization, or quick exploration.
  • Compare results: if VI and HMC differ substantially, favor HMC.

9.4 Homework 8

  1. Conceptual
    • Explain the difference between sampling-based and optimization-based inference.
    • What role does the ELBO play in VI?
  2. Computational
    • Fit a simple linear regression using both HMC and VI in brms.
    • Compare posterior means, standard deviations, and computation time.
  3. Reflection
    • In what types of real-world problems might VI be preferred over HMC?
    • How would you check whether your VI approximation is adequate?

9.5 Key Takeaways

Concept Summary
Hamiltonian Monte Carlo Uses gradients to propose efficient moves through parameter space.
No-U-Turn Sampler (NUTS) Adapts step size and trajectory automatically.
Variational Inference Optimizes a tractable approximation to the posterior.
ELBO Objective function for VI; measures closeness to the true posterior.
Trade-off HMC = accuracy, VI = speed; choice depends on model and data size.

Next Week: Bayesian Model Averaging and Ensemble Learning — combining multiple Bayesian models for improved predictive performance.