11  Week 10 — Bayesian Nonparametrics

This week introduces Bayesian Nonparametric (BNP) models, which allow infinite flexibility in representing uncertainty about functions, densities, or model structure.
We focus on two cornerstone approaches: Dirichlet Processes for mixture modeling and clustering, and Gaussian Processes for regression on functions.


11.1 Learning Goals

By the end of this week, you should be able to:

  • Explain the motivation for nonparametric Bayesian models.
  • Describe the Dirichlet Process and its stick-breaking and Chinese Restaurant representations.
  • Explain how DP mixture models perform clustering automatically.
  • Understand Gaussian Processes for function estimation.
  • Implement simple DP and GP examples in R using available packages.

11.2 Lecture 1 — Dirichlet Process Models

11.2.1 1.1 Motivation

Classical parametric models fix the number of parameters (e.g., number of clusters).
Dirichlet Processes (DPs) let the data decide model complexity by placing a prior on infinite mixture components.


11.2.2 1.2 The Dirichlet Process

A Dirichlet Process \(\text{DP}(\alpha, G_0)\) is a distribution over distributions such that,
for any partition \(A_1, \ldots, A_k\) of the space, \[ (G(A_1),\ldots,G(A_k)) \sim \text{Dirichlet}(\alpha G_0(A_1), \ldots, \alpha G_0(A_k)). \]

  • \(G_0\): base (prior mean) distribution.
  • \(\alpha\): concentration parameter controlling clustering strength.

As \(\alpha \to 0\): few clusters (more sharing).
As \(\alpha \to \infty\): many clusters (approaches \(G_0\)).


11.2.3 1.3 Stick-Breaking Representation

A constructive definition (Sethuraman, 1994): \[ G = \sum_{k=1}^{\infty} \pi_k \delta_{\theta_k}, \quad \theta_k \sim G_0, \quad \pi_k = v_k \prod_{l<k}(1-v_l), \quad v_k \sim \text{Beta}(1,\alpha). \]

The weights \(\pi_k\) form an infinite sequence summing to 1 — conceptually “breaking a stick” into random lengths.


11.2.4 1.4 DP Mixture Model

For data \(y_1,\ldots,y_n\): \[ y_i \mid \theta_i \sim F(\theta_i), \qquad \theta_i \mid G \sim G, \qquad G \sim \text{DP}(\alpha, G_0). \]

Marginally, this induces clustering because multiple \(y_i\) can share the same \(\theta_i\).


11.2.5 1.5 Chinese Restaurant Process (CRP)

Equivalent generative process:

  1. Customer 1 starts a new table.
  2. Customer \(i\) joins an existing table \(k\) with probability
    \(n_k / (\alpha + i - 1)\),
    or starts a new one with probability \(\alpha / (\alpha + i - 1)\).

This describes how clusters grow adaptively as data arrive.


11.2.6 1.6 Example — Simulated DP Mixture of Normals

set.seed(10)
library(dirichletprocess)

# Generate mixture data
y <- c(rnorm(50, -3, 0.5), rnorm(50, 3, 0.5))

# Fit a Dirichlet Process Gaussian Mixture
dp <- DirichletProcessGaussian(y)
dp <- Fit(dp, its = 2000)

# Cluster assignments
plot(dp) + ggplot2::ggtitle("Dirichlet Process Gaussian Mixture")
Warning: `aes_()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`
ℹ The deprecated feature was likely used in the dirichletprocess package.
  Please report the issue at
  <https://github.com/dm13450/dirichletprocess/issues>.

Interpretation: the DP mixture automatically discovers clusters without specifying their number in advance.


11.2.7 1.7 Practical Notes

  • The posterior number of clusters depends on \(\alpha\) and data separation.
  • Inference via Gibbs sampling or variational truncation.
  • Extensions: Hierarchical DP, DP regression, and DP topic models.

11.3 Lecture 2 — Gaussian Processes for Regression

11.3.1 2.1 Motivation

A Gaussian Process (GP) defines a prior directly over functions, enabling flexible nonlinear regression without specifying a parametric form.


11.3.2 2.2 Definition

A GP is a collection of random variables \(f(x)\) such that every finite subset has a joint multivariate normal distribution: \[ f(x) \sim \text{GP}(m(x), k(x,x')), \] where
- \(m(x) = E\[f(x)\]\): mean function,
- \(k(x,x') = \text{Cov}(f(x),f(x'))\): covariance (kernel) function.


11.3.3 2.3 GP Regression Model

For data \((x_i, y_i)\): \[ y_i = f(x_i) + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2). \] Posterior of \(f(x)\) given \(y\): \[ f_* \mid y, X, X_* \sim \mathcal{N}(\bar{f}_*, \text{Cov}(f_*)), \] where the mean and covariance are computed using kernel matrices.


11.3.4 2.4 Common Kernels

Kernel Formula Property
Squared Exponential \(k(x,x') = \tau^2 \exp\!\left(-\frac{(x-x')^2}{2\ell^2}\right)\) Smooth, infinitely differentiable
Matérn \(k(x,x') = \tau\^2 \frac{2\^{1-\nu}}{\Gamma(\nu)}(\sqrt{2\nu})\) x-x’
Periodic \(k(x,x') = \tau^2\exp(-2\sin^2(\pi))\) x-x’

11.3.5 2.5 Example — Gaussian Process Regression in R

library(GPfit)
set.seed(11)

x <- seq(-3, 3, length.out = 50)
y <- sin(x) + rnorm(50, sd = 0.2)

# Make X a column matrix (good practice for GPfit)
gp_model <- GP_fit(X = matrix(x, ncol = 1), Y = y)
Warning in GP_fit(X = matrix(x, ncol = 1), Y = y): X should be in range (0, 1)
plot(gp_model)

pred_grid <- seq(-3, 3, length.out = 200)
pred <- predict(gp_model, xnew = matrix(pred_grid, ncol = 1))

yhat <- as.numeric(pred$Y_hat)
se   <- sqrt(as.numeric(pred$MSE))  # MSE is a variance vector; no diag()

plot(x, y, pch = 19, col = "#00000055",
     main = "Gaussian Process Regression", xlab = "x", ylab = "y")
lines(pred_grid, yhat, lwd = 2, col = "darkorange")
lines(pred_grid, yhat + 2*se, lty = 2, col = "gray40")
lines(pred_grid, yhat - 2*se, lty = 2, col = "gray40")

Gaussian Process Regression Fit

Interpretation: the GP posterior mean tracks the true function smoothly, with uncertainty quantified by the shaded region.


11.3.6 2.6 GP vs DP: Comparison

Aspect Dirichlet Process Gaussian Process
Domain Distributions / Clusters Functions
Output Discrete clustering Continuous regression
Flexibility Unknown number of components Infinite function space
Typical Use Density estimation, mixture modeling Nonlinear regression, spatial data

11.3.7 2.7 Practical Considerations

  • GP computational cost is \(O(n^3)\); use sparse or inducing-point approximations for large \(n\).
  • Choice of kernel determines function smoothness and inductive bias.
  • In practice, hyperparameters (e.g., \(\ell,~\tau\)) are learned via marginal likelihood maximization.

11.4 Homework 10

  1. Conceptual
    • Compare the roles of \(\alpha\) in DP and the kernel parameters in GP.
    • Explain the difference between parametric and nonparametric Bayesian models.
  2. Computational
    • Simulate data from two Gaussian clusters and fit a DP mixture model using dirichletprocess.
    • Fit a GP regression to noisy sinusoidal data using GPfit.
    • Plot both model fits and discuss flexibility.
  3. Reflection
    • When might nonparametric models be overkill?
    • How could hierarchical extensions of DP or GP handle grouped data?

11.5 Key Takeaways

Concept Summary
Dirichlet Process Prior over distributions enabling infinite mixture models.
Stick-Breaking Construction Represents DP as weighted infinite discrete components.
Chinese Restaurant Process Intuitive clustering interpretation of DP.
Gaussian Process Defines prior over functions for regression and smoothing.
Common Feature Both model complexity grows with data — no fixed parameter dimension.

Next Week: Bayesian Time Series and State-Space Models — dynamic modeling and sequential inference using Bayesian methods.