2 Week 1 Overview

In this first week, we introduce the mathematical language and statistical framework that will be used throughout the course. Our focus is on the notation of vectors and matrices, vectors of random variables, expectation and covariance operators, and the basic form of the linear regression model.

2.1 Learning Objectives

By the end of this week, students should be able to:

use standard matrix notation for linear statistical models;

distinguish between scalars, vectors, matrices, random variables, and random vectors;

compute expectations and covariance matrices for random vectors;

interpret the linear regression model in matrix form;

understand why projection and least squares will play a central role in this course.

2.2 Reading

2.3 Why Linear Statistical Analysis?

Linear statistical analysis is one of the central foundations of graduate statistics. Many methods that at first look different are built on the same underlying structure:

regression,
analysis of variance,
analysis of covariance,
prediction,
model comparison,
and parts of generalized linear modelling.

A major goal of this course is to see these topics under a unified framework.

2.4 A unifying point of view

A large part of the course can be summarized by the model

\[ Y = X\beta + \varepsilon, \]

where

\(Y\) is a response vector,
\(X\) is a design matrix,
\(\beta\) is an unknown parameter vector,
\(\varepsilon\) is a random error vector.

This compact expression contains a great deal of statistical structure. Over the semester, we will study how to estimate \(\beta\), quantify uncertainty, test hypotheses, diagnose model failures, and make predictions.

2.5 Basic Notation

2.6 Scalars, vectors, and matrices

We use the following conventions throughout the course:

scalars are written in lowercase italic letters, such as \(a\), \(b\), \(n\);
vectors are written in bold lowercase letters, such as \(\mathbf{x}\), \(\mathbf{y}\);
matrices are written in bold uppercase letters, such as \(\mathbf{X}\), \(\mathbf{A}\);
random variables are often written in uppercase letters, such as \(Y\);
realizations of random variables are written in lowercase letters, such as \(y\).

A column vector in \(\mathbb{R}^n\) is written as

\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}. \]

An \(n \times p\) matrix is written as

\[ \mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}. \]

2.7 Transpose, inverse, and rank

If \(\mathbf{A}\) is a matrix, then:

\(\mathbf{A}^\top\) denotes its transpose;
\(\mathbf{A}^{-1}\) denotes its inverse, when it exists;
\(\mathrm{rank}(\mathbf{A})\) denotes its rank;
\(\mathbf{I}_n\) denotes the \(n \times n\) identity matrix.

2.8 Inner product and norm

For vectors \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^n\), the inner product is

\[ \mathbf{x}^\top \mathbf{y} = \sum_{i=1}^n x_i y_i. \]

The Euclidean norm is

\[ \|\mathbf{x}\| = \sqrt{\mathbf{x}^\top \mathbf{x}}. \]

These ideas are fundamental because least squares estimation is based on minimizing squared Euclidean distance.

2.9 Random Vectors

2.9.1 Definition

A random vector is a vector whose entries are random variables. For example,

\[ \mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} \]

is an \(n\)-dimensional random vector.

In linear models, the response is naturally treated as a random vector.

2.10 Mean vector

The mean vector of \(\mathbf{Y}\) is

\[ \mathbb{E}[\mathbf{Y}] = \begin{bmatrix} \mathbb{E}[Y_1] \\ \mathbb{E}[Y_2] \\ \vdots \\ \mathbb{E}[Y_n] \end{bmatrix}. \]

We often write

\[ \mathbb{E}[\mathbf{Y}] = \boldsymbol{\mu}. \]

2.11 Covariance matrix

The covariance matrix of \(\mathbf{Y}\) is

\[ \mathrm{Var}(\mathbf{Y}) = \mathbb{E}\left[ (\mathbf{Y} - \boldsymbol{\mu})(\mathbf{Y} - \boldsymbol{\mu})^\top \right]. \]

Equivalently,

\[ \mathrm{Var}(\mathbf{Y}) = \mathbb{E}[\mathbf{Y}\mathbf{Y}^\top] - \boldsymbol{\mu}\boldsymbol{\mu}^\top. \]

\[ \mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ Y_3 \end{bmatrix}, \]

then

\[ \mathrm{Var}(\mathbf{Y}) = \begin{bmatrix} \mathrm{Var}(Y_1) & \mathrm{Cov}(Y_1,Y_2) & \mathrm{Cov}(Y_1,Y_3) \\ \mathrm{Cov}(Y_2,Y_1) & \mathrm{Var}(Y_2) & \mathrm{Cov}(Y_2,Y_3) \\ \mathrm{Cov}(Y_3,Y_1) & \mathrm{Cov}(Y_3,Y_2) & \mathrm{Var}(Y_3) \end{bmatrix}. \]

##Expectation and Covariance Operators

2.11.1 Linearity of expectation

If \(\mathbf{A}\) is a constant matrix and \(\mathbf{b}\) is a constant vector, then

\[ \mathbb{E}[\mathbf{A}\mathbf{Y} + \mathbf{b}] = \mathbf{A}\mathbb{E}[\mathbf{Y}] + \mathbf{b}. \]

This is one of the most useful identities in the course.

2.11.2 Covariance of linear transformations

If \(\mathbf{A}\) is a constant matrix, then

\[ \mathrm{Var}(\mathbf{A}\mathbf{Y}) = \mathbf{A}\,\mathrm{Var}(\mathbf{Y})\,\mathbf{A}^\top. \]

More generally, if \(\mathbf{Y}\) and \(\mathbf{Z}\) are random vectors, then

\[ \mathrm{Cov}(\mathbf{A}\mathbf{Y}, \mathbf{B}\mathbf{Z}) = \mathbf{A}\,\mathrm{Cov}(\mathbf{Y},\mathbf{Z})\,\mathbf{B}^\top. \]

These formulas are essential for deriving the variance of estimators later.

2.11.3 Special case: independent components

If \(Y_1,\dots,Y_n\) are independent and each has variance \(\sigma^2\), then

\[ \mathrm{Var}(\mathbf{Y}) = \sigma^2 \mathbf{I}_n. \]

This is the most common starting assumption in classical linear regression.

2.12 Statistical Models

A statistical model is a set of probability distributions that may plausibly describe the data-generating mechanism.

2.12.1 General idea

Suppose we observe data \(y\) from a random quantity \(Y\). A model introduces assumptions about the distribution of \(Y\), often indexed by an unknown parameter \(\theta\).

For example:

\[ Y \sim N(\mu, \sigma^2) \]

with unknown parameters \(\mu\) and \(\sigma^2\).

In regression, the model is not only about the marginal distribution of the response but also about how the mean changes with explanatory variables.

2.12.2 Deterministic part and random part

A useful way to think about a statistical model is:

\[ \text{data} = \text{systematic part} + \text{random part}. \]

For linear regression, this becomes

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i. \]

Here,

\(\beta_0 + \beta_1 x_i\) is the systematic part;
\(\varepsilon_i\) is the random part.

2.13 The Linear Regression Model

2.13.1 Simple linear regression

The simplest regression model is

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,\dots,n. \]

Typical assumptions are

\[ \mathbb{E}[\varepsilon_i] = 0, \qquad \mathrm{Var}(\varepsilon_i) = \sigma^2, \qquad \mathrm{Cov}(\varepsilon_i,\varepsilon_j)=0 \text{ for } i \ne j. \]

2.13.2 Matrix form

We can write the model compactly as

\[ \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \]

where

\[ \mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}, \qquad \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}, \qquad \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \qquad \boldsymbol{\varepsilon} = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}. \]

This notation will allow us to treat simple regression, multiple regression, ANOVA, and ANCOVA in one common language.

2.13.3 Mean and variance under the model

\[ \mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0} \qquad \text{and} \qquad \mathrm{Var}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}_n, \]

then

\[ \mathbb{E}[\mathbf{Y}] = \mathbf{X}\boldsymbol{\beta} \]

and

\[ \mathrm{Var}(\mathbf{Y}) = \sigma^2 \mathbf{I}_n. \]

These are immediate consequences of the expectation and covariance rules above.

2.13.4 Geometry Preview

A central idea in linear regression is projection.

The fitted values \(\hat{\mathbf{Y}}\) will later be obtained by projecting \(\mathbf{Y}\) onto the column space of \(\mathbf{X}\).

2.13.5 Column space

The column space of \(\mathbf{X}\) is

\[ \mathcal{C}(\mathbf{X}) = \{ \mathbf{X}\boldsymbol{\beta} : \boldsymbol{\beta} \in \mathbb{R}^p \}. \]

This is the set of all mean vectors that the model can represent.

2.14 Why projection matters

The least squares estimator chooses \(\hat{\boldsymbol{\beta}}\) so that

\[ \|\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}\|^2 \]

is minimized.

So regression is not only algebra. It is also geometry.

3 Worked Example by Hand

Suppose we observe the following data:

\[ \begin{array}{c|cccc} x_i & 0 & 1 & 2 & 3 \\ \hline y_i & 1 & 3 & 3 & 5 \end{array} \]

Then

\[ \mathbf{Y} = \begin{bmatrix} 1 \\ 3 \\ 3 \\ 5 \end{bmatrix}, \qquad \mathbf{X} = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}. \]

We will learn next week that the least squares estimator is

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{Y}. \]

For now, the main point is to understand how the model is written and how the data are represented in matrix form.

4 R Demonstration

x <- c(0, 1, 2, 3)
y <- c(1, 3, 3, 5)

fit <- lm(y ~ x)
summary(fit)


Call:
lm(formula = y ~ x)

Residuals:
   1    2    3    4 
-0.2  0.6 -0.6  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   1.2000     0.5292   2.268   0.1515  
x             1.2000     0.2828   4.243   0.0513 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6325 on 2 degrees of freedom
Multiple R-squared:    0.9, Adjusted R-squared:   0.85 
F-statistic:    18 on 1 and 2 DF,  p-value: 0.05132

# Inspecting the design matrix
model.matrix(fit)

  (Intercept) x
1           1 0
2           1 1
3           1 2
4           1 3
attr(,"assign")
[1] 0 1

# Fitted values and residuals
fitted(fit)

  1   2   3   4 
1.2 2.4 3.6 4.8

residuals(fit)

   1    2    3    4 
-0.2  0.6 -0.6  0.2

# quick plot
plot(x, y, pch = 19, xlab = "x", ylab = "y")
abline(fit, lwd = 2)

5 In-Class Discussion Questions

. Why is it helpful to write regression models in matrix form rather than only scalar notation? 2. What does the covariance matrix tell us that separate variances do not? 3. What is the interpretation of the column space of \(\mathbf{X}\)? 4. In what sense is regression a projection problem?

6 Practice Problems

Conceptual 1. Explain the difference between a random variable and a random vector. 2. Explain why \(\mathrm{Var}(\mathbf{Y})\) must be a symmetric matrix. 3. Give an example of a statistical model outside regression.

Computational

Let

\[ \mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \end{bmatrix} \]

with

\[ \mathbb{E}[\mathbf{Y}] = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \qquad \mathrm{Var}(\mathbf{Y}) = \begin{bmatrix} 4 & 1 \ 1 & 9 \end{bmatrix}. \]

Let

\[ \mathbf{A} = \begin{bmatrix} 1 & 2 \\ 0 & 1 \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} 3 \\ -1 \end{bmatrix}. \]

Compute:

\(\mathbb{E}[\mathbf{A}\mathbf{Y} + \mathbf{b}]\)
\(\mathrm{Var}(\mathbf{A}\mathbf{Y})\)

Regression setup

For the model

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \]

write out the matrices \(\mathbf{Y}\), \(\mathbf{X}\), \(\boldsymbol{\beta}\), and \(\boldsymbol{\varepsilon}\) for \(n=5\) observations.

6.1 Suggested Homework

Complete the following:

review matrix multiplication and transpose rules;
derive \(\mathbb{E}[\mathbf{A}\mathbf{Y}+\mathbf{b}]\) from first principles;
derive \(\mathrm{Var}(\mathbf{A}\mathbf{Y})\) using the definition of covariance;
write the simple linear regression model in matrix form for a dataset of your choice;
fit a simple regression in R and report:
the estimated coefficients,
fitted values,
residuals,
and a scatterplot with the fitted line.

6.2 Summary

This week introduced the notation and basic probabilistic tools needed for the rest of the course. We defined random vectors, mean vectors, covariance matrices, and the matrix form of the linear regression model. These ideas will support everything that follows.

Next week, we will study least squares estimation and the geometry of projection in more detail.

For some optional review of the matrix algebra, see Chapter 14