understand why posterior approximation is needed beyond conjugate models
learn how to approximate posteriors using discrete grids and Gibbs sampling
understand how Gibbs sampling uses full conditional distributions to generate dependent posterior samples
5.1 Introduction
For many multi-parameter Bayesian models, the joint posterior distribution does not belong to a standard family (e.g., exponential family) and is therefore difficult to sample from it directly. However, it is often the case that sampling from the full conditional distribution of each parameter is straightforward.
In such situations, posterior approximation can be carried out using the Gibbs sampler, an iterative MC algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution. Here, we introduce the Gibbs sampler in the context of the normal model with a semi-conjugate prior, and study how well it approximates the posterior distribution.
5.2 A Semi-conjugate Prior Distribution
For normal distribution, it may be modelled our uncertainty about the population mean \(\theta\) as depending on the sampling variance \(\sigma^2\) via
This formulation ties the prior variance of \(\theta\) to the sampling variability of the data, and \(\mu_0\) can be interpreted as representing \(\kappa_0\) prior observations from the population.
In some settings this dependence is reasonable, but in others we may wish to specify prior uncertainty about \(\theta\)independently of \(\sigma^2\), so that \[
p(\theta, \sigma^2) = p(\theta)\,p(\sigma^2).
\]
One such specification is the following semiconjugate prior distribution: \[
\theta \sim \text{Normal}(\mu_0, \tau_0^2),
\qquad
\frac{1}{\sigma^2} \sim \text{Gamma}\!\left(\frac{\nu_0}{2}, \frac{\nu_0 \sigma_0^2}{2}\right).
\]
Then, conditional on \(\sigma^2\) and the observed data \(y_1, \dots, y_n\), the posterior distribution of \(\theta\) is \[
\theta \mid \sigma^2, y_1, \dots, y_n
\;\sim\;
\text{Normal}(\mu_n, \tau_n^2),
\] where \[
\mu_n
=
\frac{\mu_0 / \tau_0^2 + n \bar{y} / \sigma^2}
{1 / \tau_0^2 + n / \sigma^2},
\qquad
\tau_n^2
=
\left( \frac{1}{\tau_0^2} + \frac{n}{\sigma^2} \right)^{-1}.
\]
This conditional posterior distribution will form one step of the Gibbs sampler.
NoteKey Takeaway
The joint posterior of \((\theta, \sigma^2)\) is not available in closed form.
The full conditional distributions of \(\theta \mid \sigma^2, y\) and \(\sigma^2 \mid \theta, y\)are available in standard forms.
This structure makes the Gibbs sampler a natural and efficient tool for posterior approximation.
In the next section, we derive the full conditional distribution of \(\sigma^2\) and combine the two conditional updates into a complete Gibbs sampling algorithm.
In the conjugate case where \(\tau_0^2\) is proportional to \(\sigma^2\), we showed that the marginal posterior distribution \[
p(\sigma^2 \mid y_1,\ldots,y_n)
\] is an inverse-gamma distribution. In this setting, Monte Carlo samples of \((\theta,\sigma^2)\) from the joint posterior distribution can be obtained by the following two-step procedure:
Sample \[
\sigma^{2(s)} \sim p(\sigma^2 \mid y_1,\ldots,y_n),
\] which is an inverse-gamma distribution.
Sample \[
\theta^{(s)} \sim p(\theta \mid \sigma^{2(s)}, y_1,\ldots,y_n),
\] which is a normal distribution.
This approach works because both full conditional distributions are standard and easy to sample from.
However, when \(\tau_0^2\) is not proportional to \(\sigma^2\), the marginal posterior distribution of the precision \[
\frac{1}{\sigma^2}
\] is not a gamma distribution, nor any other standard distribution from which we can easily sample. As a result, direct Monte Carlo sampling from the marginal posterior is no longer straightforward, motivating the need for alternative approximation methods.
5.3 Discrete Approximations
In the conjugate case where \(\tau_0^2\) was proportional to \(\sigma^2\), we showed that \[
p(\sigma^2 \mid y_1,\dots,y_n)
\] was an inverse-gamma distribution, and Monte Carlo samples of \((\theta, \sigma^2)\) from the joint posterior distribution could be obtained by:
sampling \(\sigma^{2(s)} \sim p(\sigma^2 \mid y_1,\dots,y_n)\), an inverse-gamma distribution;
sampling \(\theta^{(s)} \sim p(\theta \mid \sigma^{2(s)}, y_1,\dots,y_n)\), a normal distribution.
However, in the semiconjugate case where \(\tau_0^2\) is not proportional to \(\sigma^2\), the marginal posterior distribution of \(1/\sigma^2\) is not a gamma distribution, nor any other standard distribution from which we can easily sample directly.
5.3.1 Posterior Density Ratios
Let \[
\tilde{\sigma}^2 = \frac{1}{\sigma^2}
\] denote the precision. Recall that the posterior distribution of \((\theta, \tilde{\sigma}^2)\) is equal to the joint distribution\[
p(\theta, \tilde{\sigma}^2, y_1,\dots,y_n),
\] divided by \(p(y_1,\dots,y_n)\), which does not depend on the parameters. Therefore, the relative posterior probabilities of two parameter values \((\theta_1, \tilde{\sigma}_1^2)\) and \((\theta_2, \tilde{\sigma}_2^2)\) are directly computable: \[
\frac{p(\theta_1, \tilde{\sigma}_1^2 \mid y_1,\dots,y_n)}
{p(\theta_2, \tilde{\sigma}_2^2 \mid y_1,\dots,y_n)}
=
\frac{p(\theta_1, \tilde{\sigma}_1^2, y_1,\dots,y_n)}
{p(\theta_2, \tilde{\sigma}_2^2, y_1,\dots,y_n)}.
\]
5.3.2 Joint Distribution
The joint density can be written as \[
\begin{aligned}
p(\theta, \tilde{\sigma}^2, y_1,\dots,y_n)
&= p(\theta, \tilde{\sigma}^2)\, p(y_1,\dots,y_n \mid \theta, \tilde{\sigma}^2) \\
&= \text{Normal}(\theta \mid \mu_0, \tau_0^2)
\times \text{Gamma}\!\left(\tilde{\sigma}^2 \mid \frac{\nu_0}{2}, \frac{\nu_0 \sigma_0^2}{2}\right)
\times \prod_{i=1}^n \text{Normal}\!\left(y_i \mid \theta, \frac{1}{\tilde{\sigma}^2}\right).
\end{aligned}
\]
All components of this joint density are standard distributions and therefore easy to evaluate numerically.
5.3.3 Discrete Posterior Approximation
A discrete approximation to the posterior distribution is obtained by constructing a grid over the parameter space and evaluating relative posterior probabilities on that grid.
Specifically:
choose grids \(\{\theta_1,\dots,\theta_G\}\) and \(\{\tilde{\sigma}_1^2,\dots,\tilde{\sigma}_H^2\}\) consisting of evenly spaced parameter values;
evaluate \(p(\theta_g, \tilde{\sigma}_h^2, y_1,\dots,y_n)\) for each grid point \((\theta_g, \tilde{\sigma}_h^2)\);
assign posterior probabilities proportional to these values:
This discrete approximation can then be normalized and used to compute posterior summaries such as means, variances, and credible regions.
5.3.4 Remarks
Discrete approximations are conceptually simple and transparent.
They are feasible only in low-dimensional parameter spaces.
For higher-dimensional models, simulation-based methods such as the Gibbs sampler become essential.
5.3.5 Discrete approximations
Let \(\{\theta_1,\ldots,\theta_G\}\) and \(\{\tilde\sigma_1^2,\ldots,\tilde\sigma_H^2\}\) be discrete grids for \(\theta\) and \(\sigma^2\), respectively.
A discrete approximation to the joint posterior distribution is defined by
This defines a valid joint probability distribution over \[
\theta \in \{\theta_1,\ldots,\theta_G\}, \qquad
\sigma^2 \in \{\tilde\sigma_1^2,\ldots,\tilde\sigma_H^2\},
\] since the probabilities sum to one. In fact, if the joint prior distribution is discrete on this grid, then \(p_D\) is exactly the posterior distribution.
5.3.5.1 Application: midge data
We now apply this approximation to the midge data from the previous chapter.
Recall that the data consist of \(n=9\) observations with sample mean \(\bar y = 1.804\) and sample variance \(s^2 = 0.017\).
In the conjugate model, the prior variance of \(\theta\) is proportional to the sampling variance \(\sigma^2 / \kappa_0\). When the sampling variance is very small, this can undesirably force the prior uncertainty about \(\theta\) to be small as well. The semiconjugate prior removes this constraint.
Recall that we previously suggested a prior mean and standard deviation of \[
\mu_0 = 1.9, \qquad \tau_0 = 0.95,
\] placing most of the prior mass on \(\theta > 0\).
For \(\sigma^2\), we take prior parameters \[
\nu_0 = 1, \qquad \sigma_0^2 = 0.01.
\]
5.3.5.2 Grid-based approximation
The R implementation evaluates the joint posterior distribution \[
p(\theta,\sigma^2 \mid y_1,\ldots,y_n)
\] on a \(100 \times 100\) grid of evenly spaced parameter values, with \[
\theta \in \{1.505, 1.510, \ldots, 1.995, 2.00\}
\] and \[
\sigma^2 \in \{1.75, 3.5, \ldots, 173.25, 175.0\}.
\]
The first panel of Figure 6.1 shows the resulting discrete approximation to the joint posterior distribution of \((\theta,\sigma^2)\).
Marginal and conditional posterior distributions can then be obtained by simple summation. For example, the marginal posterior distribution of \(\theta\) is
The resulting discrete approximations to the marginal posterior distributions of \(\theta\) and \(\sigma^2\) are shown in the second and third panels of Figure 6.1.