5  Gibbs Sampler

Leading objectives:

  • understand why posterior approximation is needed beyond conjugate models
  • learn how to approximate posteriors using discrete grids and Gibbs sampling
  • understand how Gibbs sampling uses full conditional distributions to generate dependent posterior samples

5.1 Introduction

For many multi-parameter Bayesian models, the joint posterior distribution does not belong to a standard family (e.g., exponential family) and is therefore difficult to sample from it directly. However, it is often the case that sampling from the full conditional distribution of each parameter is straightforward.

In such situations, posterior approximation can be carried out using the Gibbs sampler, an iterative MC algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution. Here, we introduce the Gibbs sampler in the context of the normal model with a semi-conjugate prior, and study how well it approximates the posterior distribution.

5.2 A Semi-conjugate Prior Distribution

For normal distribution, it may be modelled our uncertainty about the population mean \(\theta\) as depending on the sampling variance \(\sigma^2\) via

\[ \theta \mid \sigma^2 \sim N\!\left(\mu_0,\; \frac{\sigma^2}{\kappa_0}\right). \]

This formulation ties the prior variance of \(\theta\) to the sampling variability of the data, and \(\mu_0\) can be interpreted as representing \(\kappa_0\) prior observations from the population.

In some settings this dependence is reasonable, but in others we may wish to specify prior uncertainty about \(\theta\) independently of \(\sigma^2\), so that \[ p(\theta, \sigma^2) = p(\theta)\,p(\sigma^2). \]

One such specification is the following semiconjugate prior distribution: \[ \theta \sim \text{Normal}(\mu_0, \tau_0^2), \qquad \frac{1}{\sigma^2} \sim \text{Gamma}\!\left(\frac{\nu_0}{2}, \frac{\nu_0 \sigma_0^2}{2}\right). \]

Posterior Distribution of \(\theta \mid \sigma^2\)

Suppose \[ Y_1, \dots, Y_n \mid \theta, \sigma^2 \;\overset{i.i.d.}{\sim}\; (\theta, \sigma^2). \]

Then, conditional on \(\sigma^2\) and the observed data \(y_1, \dots, y_n\), the posterior distribution of \(\theta\) is \[ \theta \mid \sigma^2, y_1, \dots, y_n \;\sim\; \text{Normal}(\mu_n, \tau_n^2), \] where \[ \mu_n = \frac{\mu_0 / \tau_0^2 + n \bar{y} / \sigma^2} {1 / \tau_0^2 + n / \sigma^2}, \qquad \tau_n^2 = \left( \frac{1}{\tau_0^2} + \frac{n}{\sigma^2} \right)^{-1}. \]

This conditional posterior distribution will form one step of the Gibbs sampler.

NoteKey Takeaway
  • The joint posterior of \((\theta, \sigma^2)\) is not available in closed form.
  • The full conditional distributions of \(\theta \mid \sigma^2, y\) and \(\sigma^2 \mid \theta, y\) are available in standard forms.
  • This structure makes the Gibbs sampler a natural and efficient tool for posterior approximation.

In the next section, we derive the full conditional distribution of \(\sigma^2\) and combine the two conditional updates into a complete Gibbs sampling algorithm.

In the conjugate case where \(\tau_0^2\) is proportional to \(\sigma^2\), we showed that the marginal posterior distribution \[ p(\sigma^2 \mid y_1,\ldots,y_n) \] is an inverse-gamma distribution. In this setting, Monte Carlo samples of \((\theta,\sigma^2)\) from the joint posterior distribution can be obtained by the following two-step procedure:

  1. Sample \[ \sigma^{2(s)} \sim p(\sigma^2 \mid y_1,\ldots,y_n), \] which is an inverse-gamma distribution.

  2. Sample \[ \theta^{(s)} \sim p(\theta \mid \sigma^{2(s)}, y_1,\ldots,y_n), \] which is a normal distribution.

This approach works because both full conditional distributions are standard and easy to sample from.

However, when \(\tau_0^2\) is not proportional to \(\sigma^2\), the marginal posterior distribution of the precision \[ \frac{1}{\sigma^2} \] is not a gamma distribution, nor any other standard distribution from which we can easily sample. As a result, direct Monte Carlo sampling from the marginal posterior is no longer straightforward, motivating the need for alternative approximation methods.

5.3 Discrete Approximations

In the conjugate case where \(\tau_0^2\) was proportional to \(\sigma^2\), we showed that \[ p(\sigma^2 \mid y_1,\dots,y_n) \] was an inverse-gamma distribution, and Monte Carlo samples of \((\theta, \sigma^2)\) from the joint posterior distribution could be obtained by:

  1. sampling \(\sigma^{2(s)} \sim p(\sigma^2 \mid y_1,\dots,y_n)\), an inverse-gamma distribution;
  2. sampling \(\theta^{(s)} \sim p(\theta \mid \sigma^{2(s)}, y_1,\dots,y_n)\), a normal distribution.

However, in the semiconjugate case where \(\tau_0^2\) is not proportional to \(\sigma^2\), the marginal posterior distribution of \(1/\sigma^2\) is not a gamma distribution, nor any other standard distribution from which we can easily sample directly.

5.3.1 Posterior Density Ratios

Let \[ \tilde{\sigma}^2 = \frac{1}{\sigma^2} \] denote the precision. Recall that the posterior distribution of \((\theta, \tilde{\sigma}^2)\) is equal to the joint distribution \[ p(\theta, \tilde{\sigma}^2, y_1,\dots,y_n), \] divided by \(p(y_1,\dots,y_n)\), which does not depend on the parameters. Therefore, the relative posterior probabilities of two parameter values \((\theta_1, \tilde{\sigma}_1^2)\) and \((\theta_2, \tilde{\sigma}_2^2)\) are directly computable: \[ \frac{p(\theta_1, \tilde{\sigma}_1^2 \mid y_1,\dots,y_n)} {p(\theta_2, \tilde{\sigma}_2^2 \mid y_1,\dots,y_n)} = \frac{p(\theta_1, \tilde{\sigma}_1^2, y_1,\dots,y_n)} {p(\theta_2, \tilde{\sigma}_2^2, y_1,\dots,y_n)}. \]

5.3.2 Joint Distribution

The joint density can be written as \[ \begin{aligned} p(\theta, \tilde{\sigma}^2, y_1,\dots,y_n) &= p(\theta, \tilde{\sigma}^2)\, p(y_1,\dots,y_n \mid \theta, \tilde{\sigma}^2) \\ &= \text{Normal}(\theta \mid \mu_0, \tau_0^2) \times \text{Gamma}\!\left(\tilde{\sigma}^2 \mid \frac{\nu_0}{2}, \frac{\nu_0 \sigma_0^2}{2}\right) \times \prod_{i=1}^n \text{Normal}\!\left(y_i \mid \theta, \frac{1}{\tilde{\sigma}^2}\right). \end{aligned} \]

All components of this joint density are standard distributions and therefore easy to evaluate numerically.

5.3.3 Discrete Posterior Approximation

A discrete approximation to the posterior distribution is obtained by constructing a grid over the parameter space and evaluating relative posterior probabilities on that grid.

Specifically:

  • choose grids \(\{\theta_1,\dots,\theta_G\}\) and \(\{\tilde{\sigma}_1^2,\dots,\tilde{\sigma}_H^2\}\) consisting of evenly spaced parameter values;

  • evaluate \(p(\theta_g, \tilde{\sigma}_h^2, y_1,\dots,y_n)\) for each grid point \((\theta_g, \tilde{\sigma}_h^2)\);

  • assign posterior probabilities proportional to these values:

\[ p(\theta_g, \tilde{\sigma}_h^2 \mid y_1,\dots,y_n) \;\propto\; p(\theta_g, \tilde{\sigma}_h^2, y_1,\dots,y_n). \]

This discrete approximation can then be normalized and used to compute posterior summaries such as means, variances, and credible regions.

5.3.4 Remarks

  • Discrete approximations are conceptually simple and transparent.
  • They are feasible only in low-dimensional parameter spaces.
  • For higher-dimensional models, simulation-based methods such as the Gibbs sampler become essential.

5.3.5 Discrete approximations

Let \(\{\theta_1,\ldots,\theta_G\}\) and \(\{\tilde\sigma_1^2,\ldots,\tilde\sigma_H^2\}\) be discrete grids for \(\theta\) and \(\sigma^2\), respectively.
A discrete approximation to the joint posterior distribution is defined by

\[ p_D(\theta_k,\tilde\sigma_l^2 \mid y_1,\ldots,y_n) = \frac{p(\theta_k,\tilde\sigma_l^2 \mid y_1,\ldots,y_n)} {\sum_{g=1}^G \sum_{h=1}^H p(\theta_g,\tilde\sigma_h^2 \mid y_1,\ldots,y_n)}. \]

Using Bayes’ rule, this can be written equivalently as

\[ p_D(\theta_k,\tilde\sigma_l^2 \mid y_1,\ldots,y_n) = \frac{p(\theta_k,\tilde\sigma_l^2,y_1,\ldots,y_n)} {\sum_{g=1}^G \sum_{h=1}^H p(\theta_g,\tilde\sigma_h^2,y_1,\ldots,y_n)}. \]

This defines a valid joint probability distribution over \[ \theta \in \{\theta_1,\ldots,\theta_G\}, \qquad \sigma^2 \in \{\tilde\sigma_1^2,\ldots,\tilde\sigma_H^2\}, \] since the probabilities sum to one. In fact, if the joint prior distribution is discrete on this grid, then \(p_D\) is exactly the posterior distribution.

5.3.5.1 Application: midge data

We now apply this approximation to the midge data from the previous chapter.
Recall that the data consist of \(n=9\) observations with sample mean \(\bar y = 1.804\) and sample variance \(s^2 = 0.017\).

In the conjugate model, the prior variance of \(\theta\) is proportional to the sampling variance \(\sigma^2 / \kappa_0\). When the sampling variance is very small, this can undesirably force the prior uncertainty about \(\theta\) to be small as well. The semiconjugate prior removes this constraint.

Recall that we previously suggested a prior mean and standard deviation of \[ \mu_0 = 1.9, \qquad \tau_0 = 0.95, \] placing most of the prior mass on \(\theta > 0\).
For \(\sigma^2\), we take prior parameters \[ \nu_0 = 1, \qquad \sigma_0^2 = 0.01. \]

5.3.5.2 Grid-based approximation

The R implementation evaluates the joint posterior distribution \[ p(\theta,\sigma^2 \mid y_1,\ldots,y_n) \] on a \(100 \times 100\) grid of evenly spaced parameter values, with \[ \theta \in \{1.505, 1.510, \ldots, 1.995, 2.00\} \] and \[ \sigma^2 \in \{1.75, 3.5, \ldots, 173.25, 175.0\}. \]

The first panel of Figure 6.1 shows the resulting discrete approximation to the joint posterior distribution of \((\theta,\sigma^2)\).

Marginal and conditional posterior distributions can then be obtained by simple summation. For example, the marginal posterior distribution of \(\theta\) is

\[ p_D(\theta_k \mid y_1,\ldots,y_n) = \sum_{h=1}^H p_D(\theta_k,\tilde\sigma_h^2 \mid y_1,\ldots,y_n). \]

The resulting discrete approximations to the marginal posterior distributions of \(\theta\) and \(\sigma^2\) are shown in the second and third panels of Figure 6.1.

library(ggplot2)
library(dplyr)
library(patchwork)
library(scales)

## -----------------------------
## Data and semiconjugate prior
## -----------------------------
y  <- c(1.64,1.70,1.72,1.74,1.82,1.82,1.82,1.90,2.08)
n  <- length(y)

mu0     <- 1.9
tau0_sq <- 0.95^2
nu0     <- 1
s20     <- 0.01

## -----------------------------
## Grids: theta and precision  (tilde sigma^2 = 1/sigma^2)
## -----------------------------
G <- 100
H <- 100

mean.grid <- seq(1.505, 2.00, length.out = G)
prec.grid <- seq(1.75, 175,  length.out = H)   # this IS \tilde{\sigma}^2

post.grid <- matrix(NA_real_, nrow = G, ncol = H)

## -----------------------------
## Discrete joint posterior on the grid
## p(theta, prec | y) ∝ p(theta) p(prec) ∏ N(y_i | theta, 1/prec)
## -----------------------------
for (g in 1:G) {
  for (h in 1:H) {
    post.grid[g, h] <-
      dnorm(mean.grid[g], mu0, sqrt(tau0_sq)) *
      dgamma(prec.grid[h], shape = nu0/2, rate = s20*nu0/2) *
      prod(dnorm(y, mean.grid[g], sd = 1 / sqrt(prec.grid[h])))
  }
}
post.grid <- post.grid / sum(post.grid)

## -----------------------------
## Data frame + marginals
## -----------------------------
post_df <- expand.grid(theta = mean.grid, prec = prec.grid)
post_df$prob <- as.vector(post.grid)

theta_marg <- post_df %>%
  group_by(theta) %>%
  summarise(prob = sum(prob), .groups = "drop")

prec_marg <- post_df %>%
  group_by(prec) %>%
  summarise(prob = sum(prob), .groups = "drop")

## -----------------------------
## Plots (Figure 6.1 style)
## -----------------------------
p_joint <- ggplot(post_df, aes(theta, prec, fill = prob)) +
  geom_raster(interpolate = TRUE) +
  scale_fill_gradient(low = "white", high = "black",
                      trans = "sqrt", labels = label_number()) +
  labs(x = expression(theta),
       y = expression(tilde(sigma)^2)) +
  theme_classic() +
  theme(legend.position = "none")

p_theta <- ggplot(theta_marg, aes(theta, prob)) +
  geom_line(linewidth = 1) +
  labs(x = expression(theta),
       y = expression(p(theta~"|"~y[1:n]))) +
  theme_classic()

p_prec <- ggplot(prec_marg, aes(prec, prob)) +
  geom_line(linewidth = 1) +
  labs(x = expression(tilde(sigma)^2),
       y = expression(p(tilde(sigma)^2~"|"~y[1:n]))) +
  theme_classic()

p_joint + p_theta + p_prec

5.4 Markov Chain Monte Carlo

TBA


This Chapter borrows materials from Chapter 4 in Hoff (2009) and Chapter 4 in Computational Methods in Statistics Course