8  Density Estimation

Up to now, we have discussed various statistical methods for estimating parameters, testing hypotheses, and making predictions based on observed data. However, in many real-world applications, we often encounter situations where we need to estimate the underlying probability distribution of a dataset without assuming a specific parametric form. This is where density estimation comes into play.

Density estimation is a fundamental technique in statistics that allows us to estimate the probability density function (PDF). We know that, density does not always exist!

There are two main types of density estimation methods:

  1. Parametric Density Estimation: In this approach, we assume that the data follows a specific parametric distribution (e.g., normal, exponential, etc.) and estimate the parameters of that distribution using methods like Maximum Likelihood Estimation (MLE) or Method of Moments.

  2. Non-Parametric Density Estimation: This approach does not assume any specific parametric form for the underlying distribution. Instead, it estimates the density directly from the data using techniques such as Kernel Density Estimation (KDE) or Histogram-based methods.

8.1 Kenel Density Estimation (KDE)

Kernel Density Estimation (KDE) is a popular non-parametric method for estimating the probability density function of a random variable. The basic idea behind KDE is to place a smooth kernel function (e.g., Gaussian, Epanechnikov, etc.) at each data point and then sum these kernels to obtain a smooth estimate of the density.

The KDE at a point \(x\) is given by: \[ \hat{f}(x) = \frac{1}{n h} \sum_{i=1}^{n} K\left(\frac{x - X_i}{h}\right), \] where:

  • \(\hat{f}(x)\) is the estimated density at point \(x\),
  • \(n\) is the number of data points,
  • \(h\) is the bandwidth (smoothing parameter),
  • \(K\) is the kernel function,
  • \(X_i\) are the observed data points.

Properties of KDE:

  • The choice of kernel function \(K\) affects the smoothness of the estimated density. Common choices include Gaussian, Epanechnikov, and Uniform kernels.

  • The bandwidth \(h\) is a crucial parameter that controls the trade-off between bias and variance in the density estimate. A smaller bandwidth leads to a more detailed estimate (lower bias, higher variance), while a larger bandwidth results in a smoother estimate (higher bias, lower variance).

8.1.1 Example of using KDE in R

# Generate sample data
set.seed(123)
data <- rnorm(1000, mean = 0, sd = 1)
# Perform Kernel Density Estimation
kde <- density(data, bw = "nrd0")  # Using default bandwidth selection
# Plot the results
plot(kde, main = "Kernel Density Estimation", xlab = "Value", ylab = "Density")

# compare with the histogram
plot(kde, main = "Kernel Density Estimation", xlab = "Value", ylab = "Density", col = "blue", lwd = 2)
hist(data, probability = TRUE, breaks = 30, col = rgb(0.8, 0.2, 0.2, 0.5), add = TRUE)
legend("topright", legend = c("KDE", "Histogram"), col = c("blue", rgb(0.8, 0.2, 0.2, 0.5)), lty = c(1, NA), pch = c(NA, 15))

8.2 Properties of the Kernel

A kernel function \(K\) must satisfy the following properties:

  1. Non-negativity: \(K(u) \geq 0\) for all \(u\).
  2. Normalization: The integral of the kernel function over its entire domain must equal 1: \[ \int_{-\infty}^{\infty} K(u) \, du = 1. \]
  3. Symmetry: The kernel function should be symmetric around zero: \[ K(u) = K(-u). \]

8.2.1 Common Kernel Functions

  • Gaussian Kernel: \[ K(u) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}u^2} \]
  • Epanechnikov Kernel: \[ K(u) = \frac{3}{4}(1 - u^2) \text{ for } |u| \leq 1, \text{ and } 0 \text{ otherwise.} \]
  • Uniform Kernel: \[ K(u) = \frac{1}{2} \text{ for } |u| \leq 1, \text{ and } 0 \text{ otherwise.} \]
library(ggplot2)
library(dplyr)
library(tidyr)

# --- Kernel functions ---
gaussian_kernel <- function(u) {
  (1 / sqrt(2 * pi)) * exp(-0.5 * u^2)
}

epanechnikov_kernel <- function(u) {
  ifelse(abs(u) <= 1, 0.75 * (1 - u^2), 0)
}

uniform_kernel <- function(u) {
  ifelse(abs(u) <= 1, 0.5, 0)
}

# Grid of points for kernels
x_kernel <- seq(-4, 4, length.out = 400)

# Data frame for kernel functions
df_kernel <- tibble(
  x = x_kernel,
  Gaussian = gaussian_kernel(x_kernel),
  Epanechnikov = epanechnikov_kernel(x_kernel),
  Uniform = uniform_kernel(x_kernel)
) |>
  pivot_longer(cols = -x, names_to = "Kernel", values_to = "value") |>
  mutate(Type = "Kernel function")

# --- Density estimation using different kernels ---

set.seed(123)
sample_data <- rnorm(300, mean = 0, sd = 1)  # some example data

dens_gauss <- density(sample_data, kernel = "gaussian", n = 400)
dens_epa   <- density(sample_data, kernel = "epanechnikov", n = 400)
# 'Uniform' ≈ 'rectangular' kernel in base::density()
dens_uni   <- density(sample_data, kernel = "rectangular", n = 400)

df_dens <- bind_rows(
  tibble(x = dens_gauss$x, value = dens_gauss$y, Kernel = "Gaussian"),
  tibble(x = dens_epa$x,   value = dens_epa$y,   Kernel = "Epanechnikov"),
  tibble(x = dens_uni$x,   value = dens_uni$y,   Kernel = "Uniform")
) |>
  mutate(Type = "Density estimate")

# --- Combine and plot ---

df_all <- bind_rows(df_kernel, df_dens)

ggplot(df_all, aes(x = x, y = value)) +
  geom_line(linewidth = 1) +
  facet_grid(Type ~ Kernel, scales = "free_y") +
  labs(
    x = "x or u",
    y = "",
    title = "Kernel Functions and Corresponding Density Estimates"
  ) +
  theme_bw(base_size = 14)

8.2.2 Bandwidth Selection

Choosing an appropriate bandwidth is critical for obtaining a good density estimate. Several methods exist for bandwidth selection, including:

  • Rule of Thumb: A simple method based on the standard deviation and sample size.

  • Cross-Validation: A data-driven approach that minimizes the integrated squared error.

  • Plug-in Methods: These methods estimate the optimal bandwidth based on the data’s characteristics.

8.2.3 Example of Bandwidth Selection in R

# Generate sample data
set.seed(123)
data <- rnorm(1000, mean = 0, sd = 1)
# Perform KDE with different bandwidths
kde_small_bw <- density(data, bw = 0.1)
kde_large_bw <- density(data, bw = 1)
# Plot the results
plot(kde_small_bw, main = "KDE with Different Bandwidths", xlab = "Value", ylab = "Density", col = "red", lwd = 2)
lines(kde_large_bw, col = "green", lwd = 2)
legend("topright", legend = c("Small Bandwidth (0.1)", "Large Bandwidth (1)"), col = c("red", "green"), lty = 1)

Other Kernel Functions

  • Epanechnikov Kernel
  • Uniform Kernel
  • Triangular Kernel ### Example of Different Kernel Functions in R
# Generate sample data
set.seed(123)
data <- rnorm(1000, mean = 0, sd = 1)
# Perform KDE with different kernel functions
kde_gaussian <- density(data, kernel = "gaussian")
kde_epanechnikov <- density(data, kernel = "epanechnikov")
kde_triangular <- density(data, kernel = "triangular")
# Plot the results
plot(kde_gaussian, main = "KDE with Different Kernel Functions", xlab = "Value", ylab = "Density", col = "blue", lwd = 2)
lines(kde_epanechnikov, col = "orange", lwd = 2)
lines(kde_triangular, col = "purple", lwd = 2)
legend("topright", legend = c("Gaussian Kernel", "Epanechnikov Kernel", "Triangular Kernel"), col = c("blue", "orange", "purple"), lty = 1)

## choose different bandwidth
kde_gaussian_bw <- density(data, kernel = "gaussian", bw = 0.2)
kde_epanechnikov_bw <- density(data, kernel = "epanechnikov", bw = 0.2)
kde_triangular_bw <- density(data, kernel = "triangular", bw = 0.2)
# Plot the results
plot(kde_gaussian_bw, main = "KDE with Different Kernel Functions (bw=0.2)", xlab = "Value", ylab = "Density", col = "blue", lwd = 2)
lines(kde_epanechnikov_bw, col = "orange", lwd = 2)
lines(kde_triangular_bw, col = "purple", lwd = 2)
legend("topright", legend = c("Gaussian Kernel", "Epanechnikov Kernel", "Triangular Kernel"), col = c("blue", "orange", "purple"), lty = 1)

## Show the difference of one kernel with three bandwidths
kde_gaussian_bw1 <- density(data, kernel = "gaussian", bw = 0.1)
kde_gaussian_bw2 <- density(data, kernel = "gaussian", bw = 0.5)
kde_gaussian_bw3 <- density(data, kernel = "gaussian", bw = 1)
# Plot the results
plot(kde_gaussian_bw1, main = "Gaussian Kernel with Different Bandwidths", xlab = "Value", ylab = "Density", col = "red", lwd = 2)
lines(kde_gaussian_bw2, col = "green", lwd = 2)
lines(kde_gaussian_bw3, col = "blue", lwd = 2)
legend("topright", legend = c("Bandwidth = 0.1", "Bandwidth = 0.5", "Bandwidth = 1"), col = c("red", "green", "blue"), lty = 1)

## show the underlying data
hist(data, probability = TRUE, breaks = 30, col = rgb(0.8, 0.2, 0.2, 0.5), main = "Histogram with KDE Overlay", xlab = "Value", ylab = "Density")
lines(kde_gaussian_bw1, col = "red", lwd = 2)
lines(kde_gaussian_bw2, col = "green", lwd = 2)
lines(kde_gaussian_bw3, col = "blue", lwd = 2)
legend("topright", legend = c("Bandwidth = 0.1", "Bandwidth = 0.5", "Bandwidth = 1"), col = c("red", "green", "blue"), lty = 1)

8.3 Other ways

Besides KDE, there are other methods for density estimation, including:

  • Histogram-based Methods: Dividing the data range into bins and counting the number of observations in each bin to estimate the density.
  • Nearest Neighbor Methods: Estimating the density based on the distance to the nearest neighbors.
  • Spline-based Methods: Using spline functions to create smooth density estimates.
  • Mixture Models: Modeling the data as a mixture of several distributions (e.g., Gaussian Mixture Models).

Reference