 Figure 1: As the correlation between regressors increases, the OLS method becomes unstable. In the limit $|corr(x_{i}, x_{j})| \rightarrow 1$, the OLS objective function is no longer strictly convex and OLS solution is no longer unique.

In ordinary least squares (OLS) regression, for given $X: \mathbb{R}^{p} \rightarrow \mathbb{R}^{n}$ and $y \in \mathbb{R}^{n}$, we minimize over $\beta \in \mathbb{R}^{p}$, the sum of squared residuals

$\newcommand{\norm}{\left\lVert#1\right\rVert} \begin{equation} S(\beta) = \norm{ y - X\beta }_{2}^{2} \end{equation}$

We first illustrate the problem with using ordinary least squares (OLS) method to estimate the unknown parameters $\beta$ in the case of highly correlated regressors on a simple example in R.

Suppose we have a model $y \sim \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2}$, more specifically, let

\begin{equation} \beta_{0} = 3, \quad \beta_{1} = \beta_{2} = 1. \end{equation}

and let the sample contain 100 elements

n <- 100


We then introduce some highly correlated regressors

set.seed(42)
x1  <- rnorm(n)
x2  <- rnorm(n, mean = x1, sd = 0.01)


with correlation coefficient almost 1

cor(x1, x2)
 0.999962365268769


Let’s run the OLS method 1000 times to get a sense of the effect of highly correlated regressors

x <- as.matrix(cbind(intr = 1, x1, x2))
nsim  <- 1000
betas  <- sapply(1:nsim, function(i) {
y  <- rnorm(n, mean = 3 + x1 + x2, sd = 1)
xx <- solve(t(x) %*% x)
beta.ols <- as.vector(xx %*% t(x) %*% y)
return(beta.ols)
})


The estimator for $\beta$, obtained by the OLS method, is still unbiased

round(apply(betas, 1, mean), 3)
 2.996 1.006 0.993


But the variance becomes to large

round(apply(betas, 1, sd), 3)
  0.101 11.110 11.103


The estimated coefficients can become too large, some can even have the wrong sign

round(betas[, 1], 3)
  3.002 -7.673  9.529


The problem can be seen by drawing a contour plot of the objective function $S(\beta)$

ssr <- function(b) {
return(sum((y - x %*% b)^2))
}

xlen <- 100
ylen <- 100
xgrid <- seq(-10.1, 10.1, length.out = xlen)
ygrid <- seq(-10.1, 10.1, length.out = ylen)
zvals <- matrix(NA, ncol = xlen, nrow = ylen)
for (i in 1:xlen) {
for (j in 1:ylen) {
zvals[i, j] <- ssr(c(3, xgrid[i], ygrid[j]))
}
}

contour(x = xgrid, y = ygrid, z = zvals,
levels = c(1e3, 3e3, 6e3, 1e4))


As shown in Figure 1, as the correlation between regressors increases, matrix $X$ becomes nearly singular and OLS method becomes unstable.

In the limit

\begin{equation} |corr(x_{i}, x_{j})| \rightarrow 1 \end{equation}

the dimension of the column space decreases

\begin{equation} \text{rank}(X) < p \end{equation}

the objective function $S(\beta)$ is no longer strictly convex, and there are infinitely many solutions of OLS.

Thinking about this in a different, more intuitive way, we would like to estimate the coefficient $\beta_{1}$ as the influence of $x_{1}$ on $y$ without the influence of $x_{2}$. Since the regressors $x_{1}$ and $x_{2}$ are highly correlated, they vary together and the coefficient $\beta_{1}$ is difficult to estimate.

The OLS method does not solve these problems, as it only minimizes the sum of squared residuals, i.e. the objective function $S(\beta)$.

### Why does this happen?

In general, we have a linear model

\begin{equation} \label{eq: model} y = X \beta + \epsilon \end{equation}

and let for errors $\epsilon$ hold the assumption (Gauss–Markov)

\begin{equation} \mathbb{E}[\epsilon \epsilon^{T}] = \sigma^{2} I \end{equation}

the estimator according to the OLS method is

\begin{equation} \label{eq: betahat} \hat{\beta} = (X^{T}X)^{-1} X^{T} y. \end{equation}

From \ref{eq: model} and \ref{eq: betahat} we get

\begin{equation} \hat{\beta} - \beta = (X^{T}X)^{-1} X^{T} \epsilon \end{equation}

therefore, the covariance matrix for $\hat{\beta}$ is

\begin{align} \mathbb{E}[(\hat{\beta} - \beta) (\hat{\beta} - \beta)^{T}] \nonumber &= \mathbb{E}[\left((X^{T}X)^{-1} X^{T} \epsilon\right) \left((X^{T}X)^{-1} X^{T} \epsilon\right)^{T}] \nonumber \\[1em] &= \mathbb{E}[(X^{T}X)^{-1} X^{T} \epsilon \epsilon^{T} X (X^{T}X)^{-1}] \nonumber \\[1em] &= (X^{T}X)^{-1} X^{T} \mathbb{E}[\epsilon \epsilon^{T}] X (X^{T}X)^{-1} \nonumber \\[1em] &= \sigma^{2} (X^{T}X)^{-1} \label{eq: variance} \end{align}

where we took into account that the matrix $(X^{T}X)$ is symmetric and assumed that it is not stochastic and is independent of $\epsilon$. The variance for the coefficient $\hat{\beta}_{k}$ is the $(k, k)$-th element of the covariance matrix.

The average distance between the estimator $\hat{\beta}$ and the actual $\beta$ is

\begin{align} \mathbb{E}[(\hat{\beta} - \beta)^{T} (\hat{\beta} - \beta)] \nonumber &= \mathbb{E}[((X^{T}X)^{-1} X^{T} \epsilon)^{T} ((X^{T}X)^{-1} X^{T} \epsilon)] \nonumber \\[1em] &= \mathbb{E}[\epsilon^{T} X (X^{T}X)^{-1} (X^{T}X)^{-1} X^{T} \epsilon] \nonumber \\[1em] &= \mathbb{E}[\text{tr}(\epsilon^{T} X (X^{T}X)^{-1} (X^{T}X)^{-1} X^{T} \epsilon)] \nonumber \\[1em] &= \mathbb{E}[\text{tr}(\epsilon \epsilon^{T} X (X^{T}X)^{-1} (X^{T}X)^{-1} X^{T} )] \nonumber \\[1em] &= \sigma^{2} \text{tr}( X (X^{T}X)^{-1} (X^{T}X)^{-1} X^{T} )] \nonumber \\[1em] &= \sigma^{2} \text{tr}( X^{T} X (X^{T}X)^{-1} (X^{T}X)^{-1})] \nonumber \\[1em] &= \sigma^{2} \text{tr}((X^{T}X)^{-1}) \label{eq: distance} \end{align}

where we took into account that the distance is a scalar, so its expected value will be equal to its trace. We then used the fact that the trace is invariant with respect to cyclic permutations

$\text{tr}(ABCD) = \text{tr}(DABC)$

From \ref{eq: variance} and \ref{eq: distance}, we see that both the variance of the estimator and the distance of the estimator from the actual $\beta$ depend on the matrix $(X^{T}X)^{-1}$.

The reason why the variance of the estimator and the distance of the estimator from the actual $\beta$ become large, can be shown conveniently with the help of singular value decomposition. Let

\begin{equation} X = U \Sigma V^{T} \end{equation}

be the singular value decomposition of $X$ where $\Sigma$ contains all the singular values

\begin{equation} \sigma_{1} \geq \sigma_{2} \geq \dots \geq \sigma_{p} > 0 \end{equation}

then

\begin{align} (X^{T}X)^{-1} &= (V \Sigma^{T} \Sigma V^{T})^{-1} \nonumber \\[1em] &= (V^{T})^{-1} (\Sigma^{T} \Sigma)^{-1} V^{-1} \nonumber \\[1em] &= V (\Sigma^{T} \Sigma)^{-1} V^{T} \nonumber \\[1em] &= \sum_{j = 1}^{p} \frac{1}{\sigma_{j}^{2}} v_{j} v_{j}^{T} \label{eq: svd} \end{align}

In the limit

\begin{equation} | corr(x_{i}, x_{j}) | \rightarrow 1 \end{equation}

matrix $X$ becomes a singular and the smallest singular value vanishes

\begin{equation} \sigma_{p} \rightarrow 0 \end{equation}

and, from \ref{eq: svd}, also

\begin{equation} (X^{T}X)^{-1} \rightarrow \infty \end{equation}

Therefore, from \ref{eq: variance} and \ref{eq: distance}, both the variance of the OLS estimator and the distance of the OLS estimator to the actual $\beta$ go to infinity.