Monte Carlo (MC) integration is a simulation-based method for approximating integrals using random sampling.
In numerical integration, methods such as the trapezoidal rule use a deterministic approach. MC integration, on the other hand, employs a non-deterministic approach: each realization provides a different outcome.
4.1 What is Monte Carlo Simulation?
4.1.1 History
Monte Carlo (MC) is a casino in Monaco, famous for its gambling and games of chance. The term “Monte Carlo” was coined by physicists Stanislaw Ulam in the 1940s while working on nuclear weapon projects at Los Almos.
Monte Carlo Casino, Picture borrowed from Wikipedia
Monte Carlo methods are mainly used in three distinct problem classes:
optimization
numerical integration
generating draws from a probability distribution.
They can also be used to model phenomena with significant uncertainty in inputs, such as calculating the risk of a nuclear power plant failure. Monte Carlo methods are often implemented using computer simulations, and they can provide approximate solutions to problems that are otherwise intractable or too complex to analyze mathematically.
4.1.2 Key Steps of MC methods
Monte Carlo methods vary, but tend to follow a particular pattern:
Define a domain of possible inputs.
Generate inputs randomly from a probability distribution over the domain.
Perform a deterministic computation of the outputs.
Aggregate the results.
For example, consider a quadrant (circular sector) inscribed in a unit square. Given that the ratio of their areas is $/4\(4, the value o f\)can be approximated using the Monte Carlo method:
Draw a square, then inscribe a quadrant within it.
Uniformly scatter a given number of points over the square.
Count the number of points inside the quadrant, i.e. having a distance from the origin of less than 1.
The ratio of the inside-count and the total-sample-count is an estimate of the ratio of the two areas, π/4. Multiply the result by 4 to estimate π.
Picture borrowed from Wikipedia
4.2 Basic Monte Carlo Integration
To approximate the integral of a function \(f(x)\) over the interval \([a, b]\), we can use the following formula:
Consider the problem of estimating \(\theta = \int_0^1 g(x)dx\). If \(X_1,\dots , X_m\sim\operatorname{Unif}(0,1)\), then the MC estimator is given by:
\[\hat{\theta}=\bar{g}_m(X)=\frac{1}{m}\sum_{i=1}^m g(X_i)\] converges to \(\mathbb{E}[g(X)]\) as \(m\to\infty\) with probability 1, by Strong law of Large Number (SLLN). The simple MC estimator is unbiased, i.e., \(\bar{g}_m(X)\).
Compute a MC estimate \[
\theta = \int_0^1 \exp(-x)dx,
\] and compare the estimate with the theoretical value
set.seed(777)n <-1E3x <-runif(n)# simulated estimatortheta_hat <-exp(-x) |>mean()# theoretical valuetheta_true <-1-exp(-1)# put them in a tibble(results <-tibble(Method =c("Simulated", "Theoretical"),Value =c(theta_hat, theta_true)))
# A tibble: 2 × 2
Method Value
<chr> <dbl>
1 Simulated 0.641
2 Theoretical 0.632
To simulate \(\int_a^b g(t)dt\), use change of variable so the limit becomes from \(0\) to \(1\). This can be done through a linear transformation of the variable \(t\): \(y:=\frac{t-a}{b-a}\). Then, \(t=a+(b-a)y\) and \(dt=(b-a)dy\). Thus, we have \[
\int_a^b g(t)dt = (b-a)\int_0^1 g(a+(b-a)y)dy.
\] Alternatively, instead of using \(\operatorname{Unif}(0,1)\), we can replace it with other densities with supports between \(a\) and \(b\). One instance is, \[
\int_a^b g(t)dt = \int_a^b \frac{g(t)}{f}f dt = (b-a) \int_a^b \frac{g(t)}{b-a} dt.
\] This is a \(b-a\) times the expectation of \(g(X)\) where \(X\sim \operatorname{Unif}(a,b)\). Therefore, this integral can be estimated by averaging through the function \(g(\cdot)\) over the interval from \(a\) to \(b\) multiply by \(b-a\).
Compute a MC estimate of \[
\theta = \int_2^4 \exp(-x)dx,
\] and compare the estimate with the exact value of the integral.
library(tibble)set.seed(777)m <-1E3x <-runif(m, min =2, max =4)# simulated estimatortheta_hat <-exp(-x) |>mean() * (4-2)# theoretical valuetheta_true <-exp(-2) -exp(-4)# put into tibble(results <-tibble(Method =c("Simulated", "Theoretical"),Value =c(theta_hat, theta_true)))
# A tibble: 2 × 2
Method Value
<chr> <dbl>
1 Simulated 0.120
2 Theoretical 0.117
To summarize, the simple Monte Carlo estimator of the integral \(\theta = \int_a^b g(x)dx\) is computed as follows.
\[
\Phi(x)=\int_{-\infty}^x \frac{1}{\sqrt{2 \pi}} e^{-t^2 / 2} d t
\]
Note, we cannot apply the algorithm above directly because the limits of integration cover an unbounded interval. However, we can break this problem into two cases: (i) \(x \ge 0\) and (ii) \(x < 0\), and use the symmetry of the normal density to handle the second case. Then the problem is to estimate \(\theta =\int_0^x \theta = \int_0^x \exp(−t^2/2) dt\) for \(x > 0\). This can be done by generating random \(\operatorname{Unif}(0,x)\) numbers, but it would mean changing the parameters of the uniform distribution for each different value of the cdf required. Suppose that we prefer an algorithm that always samples from \(\operatorname{Unif}(0,1)\). This can be accomplished by a change of variables. Making the substitution \(y = t/x\), we have \(dt = x dy\) and
\[
\theta=\int_0^1 x e^{-(x y)^2 / 2} d y
\] Thus, \(\theta = \mathbb{E}_Y[x\exp(-(xY)^2/2)]\), where the rv \(Y\sim \operatorname{Unif}(0,1)\). Generate iid \(\operatorname{Unif}(0,1)\) random numbers \(u_1,\dots,u_m\), and compute \[\hat{\theta}=\overline{g_m(u)}=\frac{1}{m} \sum_{i=1}^m x e^{-\left(u_i x\right)^2 / 2}.
\]
The sample mean \(\hat{\theta}\) converges to \(\mathbb{E}\hat{\theta} = \theta\) as \(m\to \infty\). If \(x > 0\), the estimate of \(\Phi(x) = 1/2 + \hat{\theta}/\sqrt{2\pi}\). If \(x < 0\) compute \(\Phi(x) = 1 − \Phi(−x)\).
x <-seq(.1, 2.5, length =10)m <-10000u <-runif(m)cdf <-numeric(length(x))for (i in1:length(x)) { g <- x[i] *exp(-(u * x[i])^2/2) cdf[i] <-mean(g) /sqrt(2* pi) +0.5}Phi <-pnorm(x)print(round(rbind(x, cdf, Phi), 3))
Notice that it would have been simpler to generate random Uniform(0, x) random variables and skip the transformation. In fact, the integrand of the previous example is itself a density function, and we can generate random variables from this density. This provides a more direct approach to estimating the integral.
Let \(I(\cdot)\) be the indicator function, and \(Z\sim N(0,1)\). Then for any constant \(x\) we have \(\mathbb{E}[I(Z ≤ x)] = P (Z ≤ x) =\Phi(x)\), the standard normal cdf evaluated at \(x\).
Generate a random sample \(z_1, \dots , z_m\sim N(0,1)\). Then the theoretical mean and sample mean are \[\hat{\theta} = \frac{1}{m} \sum_{i=1}^m I(z_i \le x),\] and \[\mathbb{E}[\hat{\theta}] = P(Z \le x) = \Phi(x).\]
In this example, compared with the previous example, it appears that we have better agreement with pnorm() in the upper tail, but worse agreement near the center.
Note
Summarizing, if \(f (x)\) is a probability density function supported on a set A, (that is, \(f (x) \ge 0\) for all \(x\in\mathbb{R}\) and \(\int_A f (x) = 1\)), to estimate the integral \[\theta = \int_A g(x)f (x)dx,\] generate a random sample \(x_1,\dots,x_m\) from the distribution \(f (x)\), and compute the sample mean \[\hat{\theta} = \frac{1}{m}\sum_{i=1}^mg(x_i).\] Then \(\hat{\theta}\overset{p}{\to}\theta\)
The standard error of \(\hat{\theta} = \frac{1}{m}\sum_{i=1}^m g(x_i)\).
Recall that, \(\mathbb{V}ar{\hat{\theta}}=\sigma^2/m\) where \(\sigma^2 = \mathbb{V}ar{g(X)}\). When the distribution of \(X\) is unknown, we substitute for \(F_X\) the empirical distribution \(F_m\) of the sample \(x_1, \dots , x_m\). The variance of \(\hat{\theta}\) can be estimated by \[
\frac{\hat{\sigma}}{m} = \frac{1}{m^2}\sum_{i=1}^m [g(x_i) - \bar{g}(x)]^2.
\]
Note that \[\frac{1}{m}\sum_{i=1}^m[g(x_i)-\bar{g}(x_i)],\] is the plug-in estimate of \(\mathbb{V}ar\{g(X)\}\). That is, it is the variance of \(U\) , where \(U\) is uniformly distributed on the set of replicates \(\{g(xi)\}\). The corresponding estimated standard error of \(\hat{\theta}\) is \[
\widehat{se}(\hat{\theta}) = \frac{\hat{\sigma}}{\sqrt{m}} = \frac{1}{m}\left\{\sum_{i=1}^m [ g(x_i)- \bar{g}(x)]^2\right\}.
\] The CLT implies \[
\frac{\hat{\theta}-E[\hat{\theta}]}{\sqrt{\operatorname{Var} \hat{\theta}}} \overset{D}{\to} N(0,1),
\] as \(m\to\infty\).Hence, if \(m\) is sufficiently large, \(\hat{\theta}\) is approximately normal with mean \(\theta\). The large-sample, approximately normal distribution of \(\hat{\theta}\) can be applied to put confidence limits or error bounds on the MC estimate of the integral, and check for convergence
Compute the 95% confidence interval for \(\Phi(2)\) and \(\Phi(2.5)\).
the probability \(P (I(Z < x) = 1)\) is \(\Phi(2)\approx 0.977\). Here \(g(X)\) has the distribution of the sample proportion of 1’s in \(m = 10000\) Bernoulli trials with \(p\approx 0.977\), and the variance of \(g(X)\) is therefore \((0.977)(1 − 0.977)/10000 =2.223e-06\). The MC estimate \(2.228e-06\) of variance is quite close to this value.
Q: What about \(\Phi(2.5)\)?
4.3 Variance and Efficiency
We have seen that a MC approach to estimating the integral \(\int_a^b g(x)dx\) is to represent the integral as the expected value of a function of a uniform random variable. That is, suppose \(X\sim \operatorname{Unif}(0,1)\), then \(f(x) = (b-a)^{-1}\) for \(x\in[a,b]\) and 0 otherwise, and \[
\begin{aligned}
\theta & =\int_a^b g(x) d x \\
& =(b-a) \int_a^b g(x) \frac{1}{b-a} d x=(b-a) E[g(X)]
\end{aligned}
\]
Recall that the sample-mean MC estimator of the integral \(\theta\) is computed as follows.
The sample mean \(\bar{g}(X)\) has expected value \(g(X)=\theta/(b-a)\) and variance \(\mathbb{V}ar\{\bar{g}(X)\}=\mathbb{V}ar\{g(X)\}/m\). By CLT, \(\bar{g}(X)\) is approximately normal for large \(m\). Therefore, the variance of the MC estimator \(\hat{\theta}\) is approximately normal with mean \(\theta\) and variance \(\mathbb{V}ar\{g(X)\}/m\).
The “hit-or-miss” approach to MC integration also uses a sample mean to estimate the integral, but the sample mean is taken over a different sample and therefore this estimator has a different variance than the one we have above.
Hit-or-Miss MC, image borrowed from UBC
Suppose \(f(x)\) is the density of a random variable \(X\).
The “hit-or-miss” approach to estimating \[
F(x) = \int_{-\infty}^x f(t)\,dt
\]
is as follows:
Generate a random sample \(X_1,\dots,X_m\) from the distribution of \(X\).
Note that the random variable \(Y = g(X)\) has the distribution \(\text{Binomial}(1, p)\), where the success probability is \[
p := P(X \leq x) = F(x).
\]
The transformed sample \(Y_1, \ldots, Y_m\) are the outcomes of $ m$ independent, identically distributed Bernoulli trials.
The estimator \(\hat{F}(x)\) is the sample proportion \[
\hat{p} = \frac{y}{m},
\]
where \(y\) is the total number of successes observed in \(m\) trials. Hence,
The variance of \(\hat{F}(x)\) can be estimated by \[
\frac{\hat{p}(1-\hat{p})}{m} = \frac{\hat{F}(x)\big(1-\hat{F}(x)\big)}{m}.
\]
The maximum variance occurs when \(F(x) = \tfrac{1}{2}\), so a conservative estimate of the variance of \(\hat{F}(x)\) is \[
\frac{1}{4m}.
\]
4.3.1 Efficiency
Suppose we have two (unbiased) estimators \(\hat{\theta}_1\) and \(\hat{\theta}_2\) for θ. We say \(\hat{\theta}_1\) is statistically more efficient than \(\hat{\theta}_2\) if \[\mathbb{V}ar(\hat{\theta}_1) < \mathbb{V}ar(\hat{\theta}_2).\] What is the variance of \(\hat{\theta}_i\) is unknown or hard to be calculated?
can substitute it by sample estimate of the variance for each estimator.
Note, the variance can always be reduced by increasing the number of replicates, so computational efficiency is also relevant.
Reference used:
Chapter 6 of Rizzo, M.L. (2007). Statistical Computing with R. CRC Press, Roca Baton.