This week introduces Bayesian Nonparametric (BNP) models, which allow infinite flexibility in representing uncertainty about functions, densities, or model structure.
We focus on two cornerstone approaches: Dirichlet Processes for mixture modeling and clustering, and Gaussian Processes for regression on functions.
11.1 Learning Goals
By the end of this week, you should be able to:
Explain the motivation for nonparametric Bayesian models.
Describe the Dirichlet Process and its stick-breaking and Chinese Restaurant representations.
Explain how DP mixture models perform clustering automatically.
Understand Gaussian Processes for function estimation.
Implement simple DP and GP examples in R using available packages.
11.2 Lecture 1 — Dirichlet Process Models
11.2.1 1.1 Motivation
Classical parametric models fix the number of parameters (e.g., number of clusters). Dirichlet Processes (DPs) let the data decide model complexity by placing a prior on infinite mixture components.
11.2.2 1.2 The Dirichlet Process
A Dirichlet Process\(\text{DP}(\alpha, G_0)\) is a distribution over distributions such that,
for any partition \(A_1, \ldots, A_k\) of the space, \[
(G(A_1),\ldots,G(A_k)) \sim \text{Dirichlet}(\alpha G_0(A_1), \ldots, \alpha G_0(A_k)).
\]
The weights \(\pi_k\) form an infinite sequence summing to 1 — conceptually “breaking a stick” into random lengths.
11.2.4 1.4 DP Mixture Model
For data \(y_1,\ldots,y_n\): \[
y_i \mid \theta_i \sim F(\theta_i), \qquad
\theta_i \mid G \sim G, \qquad
G \sim \text{DP}(\alpha, G_0).
\]
Marginally, this induces clustering because multiple \(y_i\) can share the same \(\theta_i\).
11.2.5 1.5 Chinese Restaurant Process (CRP)
Equivalent generative process:
Customer 1 starts a new table.
Customer \(i\) joins an existing table \(k\) with probability \(n_k / (\alpha + i - 1)\),
or starts a new one with probability \(\alpha / (\alpha + i - 1)\).
This describes how clusters grow adaptively as data arrive.
11.2.6 1.6 Example — Simulated DP Mixture of Normals
set.seed(10)library(dirichletprocess)# Generate mixture datay <-c(rnorm(50, -3, 0.5), rnorm(50, 3, 0.5))# Fit a Dirichlet Process Gaussian Mixturedp <-DirichletProcessGaussian(y)dp <-Fit(dp, its =2000)# Cluster assignmentsplot(dp) + ggplot2::ggtitle("Dirichlet Process Gaussian Mixture")
Warning: `aes_()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`
ℹ The deprecated feature was likely used in the dirichletprocess package.
Please report the issue at
<https://github.com/dm13450/dirichletprocess/issues>.
Interpretation: the DP mixture automatically discovers clusters without specifying their number in advance.
11.2.7 1.7 Practical Notes
The posterior number of clusters depends on \(\alpha\) and data separation.
Inference via Gibbs sampling or variational truncation.
Extensions: Hierarchical DP, DP regression, and DP topic models.
11.3 Lecture 2 — Gaussian Processes for Regression
11.3.1 2.1 Motivation
A Gaussian Process (GP) defines a prior directly over functions, enabling flexible nonlinear regression without specifying a parametric form.
11.3.2 2.2 Definition
A GP is a collection of random variables \(f(x)\) such that every finite subset has a joint multivariate normal distribution: \[
f(x) \sim \text{GP}(m(x), k(x,x')),
\] where
- \(m(x) = E\[f(x)\]\): mean function,
- \(k(x,x') = \text{Cov}(f(x),f(x'))\): covariance (kernel) function.
11.3.3 2.3 GP Regression Model
For data \((x_i, y_i)\): \[
y_i = f(x_i) + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2).
\] Posterior of \(f(x)\) given \(y\): \[
f_* \mid y, X, X_* \sim \mathcal{N}(\bar{f}_*, \text{Cov}(f_*)),
\] where the mean and covariance are computed using kernel matrices.
11.3.5 2.5 Example — Gaussian Process Regression in R
library(GPfit)set.seed(11)x <-seq(-3, 3, length.out =50)y <-sin(x) +rnorm(50, sd =0.2)# Make X a column matrix (good practice for GPfit)gp_model <-GP_fit(X =matrix(x, ncol =1), Y = y)
Warning in GP_fit(X = matrix(x, ncol = 1), Y = y): X should be in range (0, 1)
plot(gp_model)
pred_grid <-seq(-3, 3, length.out =200)pred <-predict(gp_model, xnew =matrix(pred_grid, ncol =1))yhat <-as.numeric(pred$Y_hat)se <-sqrt(as.numeric(pred$MSE)) # MSE is a variance vector; no diag()plot(x, y, pch =19, col ="#00000055",main ="Gaussian Process Regression", xlab ="x", ylab ="y")lines(pred_grid, yhat, lwd =2, col ="darkorange")lines(pred_grid, yhat +2*se, lty =2, col ="gray40")lines(pred_grid, yhat -2*se, lty =2, col ="gray40")
Gaussian Process Regression Fit
Interpretation: the GP posterior mean tracks the true function smoothly, with uncertainty quantified by the shaded region.
11.3.6 2.6 GP vs DP: Comparison
Aspect
Dirichlet Process
Gaussian Process
Domain
Distributions / Clusters
Functions
Output
Discrete clustering
Continuous regression
Flexibility
Unknown number of components
Infinite function space
Typical Use
Density estimation, mixture modeling
Nonlinear regression, spatial data
11.3.7 2.7 Practical Considerations
GP computational cost is \(O(n^3)\); use sparse or inducing-point approximations for large \(n\).
Choice of kernel determines function smoothness and inductive bias.
In practice, hyperparameters (e.g., \(\ell,~\tau\)) are learned via marginal likelihood maximization.
11.4 Homework 10
Conceptual
Compare the roles of \(\alpha\) in DP and the kernel parameters in GP.
Explain the difference between parametric and nonparametric Bayesian models.
Computational
Simulate data from two Gaussian clusters and fit a DP mixture model using dirichletprocess.
Fit a GP regression to noisy sinusoidal data using GPfit.
Plot both model fits and discuss flexibility.
Reflection
When might nonparametric models be overkill?
How could hierarchical extensions of DP or GP handle grouped data?
11.5 Key Takeaways
Concept
Summary
Dirichlet Process
Prior over distributions enabling infinite mixture models.
Stick-Breaking Construction
Represents DP as weighted infinite discrete components.
Chinese Restaurant Process
Intuitive clustering interpretation of DP.
Gaussian Process
Defines prior over functions for regression and smoothing.
Common Feature
Both model complexity grows with data — no fixed parameter dimension.
Next Week: Bayesian Time Series and State-Space Models — dynamic modeling and sequential inference using Bayesian methods.