Jekyll2022-05-31T11:44:46+00:00https://markolalovic.com/blog/feed.xmlMarko LalovicTechnical blog on machine learning, programming and and other math-related topics.Lasso dual2021-08-01T09:34:39+00:002021-08-01T09:34:39+00:00https://markolalovic.com/blog/lasso-dual<div class="images"> <img src="/blog/assets/posts/lasso-dual/transformation-3d.svg" /> <div class="label"> <strong>Figure 1:</strong> Illustration when $n=p=3$ of primal and dual admissible sets $C$ and $D$. Solution to (1) can be determined by a projection of $y$. Furthermore, if $A^{-1}$ exists, then the primal solution can be expressed as a function of $y$. </div> </div> <p>Let $A: \mathbb{R}^{p} \rightarrow \mathbb{R}^{n}$ and $y \in \mathbb{R}^{p}$. Consider the Tikhonov regularization</p> $\newcommand{\norm}{\left\lVert#1\right\rVert} \begin{equation} \min_{x \in \mathbb{R}^{n}} \frac{1}{2} \norm{Ax - y}_{2}^{2} + \alpha \norm{x}_{1}. \label{eq:min-problem} \end{equation}$ <p>This type of penalized regression is called Lasso; see Tibshirani’s original paper [<a href="http://statweb.stanford.edu/~tibs/lasso/lasso.pdf" target="_blank">1</a>].</p> <p>In this post, we first derive the dual problem, then show that the solution $x^{*}$ can be determined with the help of a projection operator. Under the assumption that $A^{-1}$ exists, we can further express the solution $x^{*}$ with the help of a mapping $\left(A^{T}\right)^{-1}$ from the dual space as a function of $y$; see <strong>Figure 1</strong>.</p> <p>Some nice and non-obvious properties of the solution $x^{*}$ follow from the geometry of the dual formulation. For example that the Lasso solution is non-expansive as a function of $y$. This is not obvious and would probably be hard to show without the dual formulation.</p> <!-- Section 1 --> <h2 id="formulation-of-the-dual-problem">Formulation of the dual problem</h2> <p>To derive the dual problem, we can introduce a dummy variable $z \in \mathbb{R}^{n}$</p> $\begin{equation}\nonumber z = Ax \end{equation}$ <p>and reformulate the minimization problem in \eqref{eq:min-problem} as a constrained problem</p> \begin{align*} \label{eq:primal-problem} \tag{P} &amp;\underset{z \in \mathbb{R}^{n}, x \in \mathbb{R}^{p}}{\text{minimize}} \quad \frac{1}{2} \norm{y - z}_{2}^{2} + \alpha \norm{x}_{1} \\ &amp;\text{subject to} \quad z = Ax \end{align*} <p>Then we can construct the Lagrangian by introducing the dual variable $p \in \mathbb{R}^{n}$ (containing $n$ Lagrange multipliers)</p> $\begin{equation}\nonumber L(x, z, p) = \frac{1}{2} \norm{y - z}_{2}^{2} + \alpha \norm{x}_{1} + p^{T} (z - Ax) \end{equation}$ <p>The dual objective function is</p> $\begin{equation}\nonumber g(p) = \min_{z \in \mathbb{R}^{n}, x \in \mathbb{R}^{p}} \left\lbrace \frac{1}{2} \norm{y - z}_{2}^{2} + \alpha \norm{x}_{1} + p^{T} (z - Ax) \right\rbrace \end{equation}$ <p>We can split the terms depending on $z$ and $x$ and minimize each part separately</p> \begin{align}\nonumber g(p) &amp;= \min_{z \in \mathbb{R}^{n}, x \in \mathbb{R}^{p}} \left\lbrace \frac{1}{2} \norm{y}_{2}^{2} - y^{T}z + \frac{1}{2} \norm{z}_{2}^{2} + p^{T}z + \alpha \norm{x}_{1} - p^{T}Ax \right\rbrace \\[.3em] &amp;= \min_{z \in \mathbb{R}^{n}}\nonumber \left\lbrace \frac{1}{2} \norm{y}_{2}^{2} - (y - p)^{T}z + \frac{1}{2} \norm{z}_{2}^{2} \right\rbrace + \max_{x \in \mathbb{R}^{p}} \left\lbrace \alpha \norm{x}_{1} - (A^{T}p)^{T}x \right\rbrace \end{align} <p>We can use the stationarity condition, which says that at the optimal point, the subgradient of $L(x, z, p)$ with respect to $x$ and $z$ must contain 0.</p> <p>For the first part, since $L(x, z, p)$ is differentiable in $z$, the subgradient with respect to $z$ equals the gradient. By taking $\frac{\partial}{\partial z} L(x, z, p)$ and setting it to $0$, we get the stationarity condition</p> $\begin{equation}\nonumber z = y - p^{*} \end{equation}$ <p>Plugging this into the first part, we get</p> \begin{align}\nonumber &amp;\frac{1}{2}\norm{y}_{2}^{2} - (y - p^{*})^{T}(y - p^{*}) + \frac{1}{2}\norm{y - p^{*}}_{2}^{2} \\[.3em] &amp;= \frac{1}{2}\norm{y}_{2}^{2}\nonumber - \frac{1}{2}\norm{y - p^{*}}_{2}^{2} \end{align} <p>For the second part, because $\alpha \norm{x}_1$ is a non-differentiable function of $x$, we need to compute the subdifferential $\partial (\alpha \norm{x}_1)$.</p> <p>By using the rules for the subgradient of the maximum we can derive, see [<a href="https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf" target="_blank">2</a>], that the $\partial \norm{x}_{1}$ can be expressed as</p> $\begin{equation}\nonumber \partial (\norm{x}_{1}) = \lbrace g : \norm{g}_{\infty} \leq 1, g^{T} x = \norm{x}_{1} \rbrace \end{equation}$ <p>and using the rule for scalar multiplication</p> $\begin{equation}\nonumber \partial (\alpha\norm{x}_{1}) = \lbrace g : \norm{g}_{\infty} \leq \alpha, g^{T} x = \alpha\norm{x}_{1} \rbrace \end{equation}$ <p>Thus, we get the stationarity condition</p> $\begin{equation}\nonumber g = A^{T}p \in \partial \alpha \norm{x}_{1} \end{equation}$ <p>when</p> \begin{align} \nonumber \norm{A^{T}p}_{\infty} \leq \alpha \\[.3em] \alpha \norm{x}_{1} - (A^{T}p)^{T}x = 0\nonumber \end{align} <p>Therefore, the dual problem is</p> \begin{align*} \label{eq:dual-problem} \tag{D} &amp;\max_{p \in \mathbb{R}^{n}} \frac{1}{2}\norm{y}_{2}^{2} - \frac{1}{2}\norm{y - p}_{2}^{2} \\[.3em] &amp;\text{subject to} \norm{A^{T}p}_{\infty} \leq \alpha \end{align*} <!-- Section 2 --> <h2 id="solution-of-the-dual-problem">Solution of the dual problem</h2> <p>The solution $p^{*}$ of the dual problem can be determined with the help of a projection operator.</p> <p>Looking at the dual problem formulation \eqref{eq:dual-problem}, we see that we can omit $\frac{1}{2}\norm{y}_{2}^{2}$, since this is constant. Then if we multiply it by 2 and flip the sign, we get an equivalent dual problem</p> \begin{align*} \label{eq:dual-problem-prime} \tag{D'} &amp;\min_{p \in \mathbb{R}^{n}} \norm{y - p}_{2}^{2} \\[.3em] &amp;\text{subject to} \norm{A^{T}p}_{\infty} \leq \alpha \end{align*} <p>A <em>projection operator</em> $P_{C}$ can be defined as</p> \DeclareMathOperator*{\argmin}{arg\,min} \begin{align}\nonumber P_{C} \text{: } &amp; \mathbb{R}^{n} \rightarrow C \\[.3em] &amp; y \mapsto P_{C}(y) := \argmin_{p \in C} \norm{y - p}_{2} \nonumber \end{align} <p>where $C \subset \mathbb{R}^{n}$ is closed and convex set. Looking at \eqref{eq:dual-problem-prime}, we see that $p^{*}$ is a projection</p> $\begin{equation}\nonumber p^{*} = P_{C}(y) \end{equation}$ <p>where $C$ is equal to</p> $\begin{equation}\nonumber C = \lbrace p \in \mathbb{R}^{n} : \norm{A^{T}p}_{\infty} \leq \alpha \rbrace \end{equation}$ <p>We notice that $C$ is indeed closed and convex. For example if $n=p=2$, and let</p> $\begin{equation}\nonumber A^{T} = \begin{pmatrix} a_{11} &amp; a_{21}\\ a_{12} &amp; a_{22} \end{pmatrix} \end{equation}$ <p>then</p> $\begin{equation}\nonumber C = \lbrace -\alpha \leq a_{11} p_{1} + a_{12} p_{2} \leq \alpha \rbrace \cap \lbrace -\alpha \leq a_{21} p_{1} + a_{22} p_{2} \leq \alpha \rbrace \end{equation}$ <p>Or we can express $C$ as</p> $\begin{equation}\nonumber C = (A^{T})^{-1}(D) \end{equation}$ <p>where $D$ is equal to</p> $\begin{equation}\nonumber D = \lbrace d \in \mathbb{R}^{p} : \norm{d}_{\infty} \leq \alpha \rbrace \end{equation}$ <p>Again for $p=2$</p> $\begin{equation}\nonumber D = \lbrace -\alpha \leq d_{1} \leq \alpha \rbrace \cap \lbrace -\alpha \leq d_{2} \leq \alpha \rbrace \end{equation}$ <div class="images"> <img src="/blog/assets/posts/lasso-dual/transformation-2d.svg" /> <div class="label"> <strong>Figure 2:</strong> Illustration when $n=p=2$ of primal and dual admissible sets $C$ and $D$. </div> </div> <!-- Section 3 --> <h2 id="solution-of-the-primal-problem">Solution of the primal problem</h2> <p>From optimality condition of the dual problem, we can derive the primal solution under the assumption that $A^{-1}$ exists.</p> <p>During the formulation of the dual problem, we introduced the dummy variable $z = Ax$ and derived the stationary condition $z = y - p^{*}$. From this, we get that every solution $x^{*}$ of \eqref{eq:primal-problem} should satisfy</p> $\begin{equation}\nonumber A x^{*} = y - p^{*} \end{equation}$ <p>where $p^{*}$ is a solution of the dual problem \eqref{eq:dual-problem}. Therefore, if $A^{-1}$ exists, the primal solution is</p> $\begin{equation}\nonumber x^{*} = A^{-1} \left( y - P_{C}(y) \right) \end{equation}$ <p>We notice that, since $P_{C}$ is a projection onto convex set $C$, it follows that $x^{*}$ is also non-expansive as a function of $y$; see <strong>Figure 3(a)</strong> in contrast to <strong>Figure 3(b)</strong> showing projection onto non-convex set $N$. This is not obvious and would probably be hard to show without the dual formulation.</p> <table> <thead> <tr> <th style="text-align: center"><img src="/blog/assets/posts/lasso-dual/convex.svg" alt="convex.svg" /></th> <th style="text-align: center"><img src="/blog/assets/posts/lasso-dual/bean.svg" alt="bean.svg" /></th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><span class="subfig">(a) Convex set $C$.</span></td> <td style="text-align: center"><span class="subfig">(b) Non-convex set $N$.</span></td> </tr> </tbody> </table> <div class="images"> <div class="label"> <strong>Figure 3:</strong> Lasso solution $x^{*}$ is non-expansive as a function of $y$. </div> </div> <h2 id="references">References</h2> <p> Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267-288). &lt;a target=”_blank” href=”https://webdoc.agsci.colostate.edu/koontz/arec-econ535/papers/Tibshirani%20(JRSS-B%201996).pdf&lt;/a&gt;</p> <p> S. Boyd and L. Vandenberghe, “Subgradients Notes”, (2008), Stanford University <a target="_blank" href="https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf">https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf</a></p>Marko LalovicFigure 1: Illustration when $n=p=3$ of primal and dual admissible sets $C$ and $D$. Solution to (1) can be determined by a projection of $y$. Furthermore, if $A^{-1}$ exists, then the primal solution can be expressed as a function of $y$.Topological features applied to the MNIST data set2020-09-25T20:26:36+00:002020-09-25T20:26:36+00:00https://markolalovic.com/blog/tda-digits<p>Persistent homology is a fascinating mathematical tool that continues to be studied, developed, and applied. The purpose of this tutorial is to give a friendly introduction on how to use the persistent homology that does not require substantial knowledge of topological methods. To illustrate the use of persistent homology in machine learning we apply it to the MNIST data set of handwritten digits. A very similar approach can be applied to any point cloud data and can be generalized to higher dimensions.</p> <div class="images"> <img src="/blog/assets/posts/tda-digits/intro-figure.svg" /> <div class="label"> <strong>Figure 1:</strong> Illustration of main ideas: <strong>(A)</strong> a sample of handwritten digits, <strong>(B)</strong> the extracted graph structure and <strong>(C)</strong> a torus with two cycles on its surface. </div> </div> <p>The main problem we are trying to solve is how to extract the topological features that can be used as an input to standard machine learning algorithms. We will use a similar approach as described in [<a href="https://arxiv.org/abs/1304.0530" target="_blank">1</a>].</p> <p>From each image, we first construct a graph, where pixels of the image correspond to vertices of the graph and we add edges between adjacent pixels; see <strong>Figure 1A</strong> and <strong>1B</strong>. We then extract 0- and 1-dimensional topological features called <em>Betti numbers</em> denoted by $\beta_{0}$ and $\beta_{1}$. For example, a torus has one connected component so $\beta_{0} = 1$, and two cycles or loops so $\beta_{1} = 2$; see <strong>Figure 1C</strong>.</p> <p>A pure topological classification cannot distinguish between individual numbers, as the numbers are topologically too similar. For example, numbers 6 and 9 are topologically the same if we use this style for writing numbers. Persistent homology, however, gives us more information.</p> <div class="images"> <img src="/blog/assets/posts/tda-digits/anim-compressed.gif" /> <div class="label"> <strong>Figure 2:</strong> Applying this technique on <strong>(A)</strong> an image of a handwritten digit 8, <strong>(B)</strong> extracting the graph structure and <strong>(C)</strong> the resulting barcodes as we sweep across the image. </div> </div> <p>(The pauses in the animation will be explained below, you can also play with this example interactively: <a href="http://markolalovic.com/tda-digits/" target="_blank">here</a>.)</p> <p>We define a filtration on the vertices of the graph corresponding to the image pixels, adding vertices and edges as we sweep across the image in the vertical or horizontal direction; see <strong>Figure 2A</strong> and <strong>2B</strong>. This adds spatial information to the topological features. For example, though 6 and 9 both have a single loop, it will appear at different locations in the filtration. We then compute the persistent homology given the simplex stream from the filtration to get the finite set of intervals called <em>Betti barcodes</em>; <strong>Figure 2C</strong>.</p> <p>Finally, we extract 4 features from the k-dimensional barcode from the invariants discussed in [<a href="https://arxiv.org/abs/1304.0530" target="_blank">1</a>]. For each of 4 sweep directions: top, bottom, right, left and dimensions: 0 and 1, we compute 4 features. This gives us a total of 32 features per image. On the extracted features from a set of images, we then apply a support vector machine (SVM) learning algorithm to classify the images.</p> <p><strong>Overview.</strong> First, we give some motivation and put the problem in a broader context of Topological Data Analysis and Computational Topology. Next, we explain the extraction of topological features on the simple example shown in <strong>Figure 2</strong>. Finally, we present and evaluate the empirical classification results on a subset of the MNIST database.</p> <p>The aim is to demonstrate the classification potential of the technique and not to outperform the existing models. For a more interesting example of using this technique on a clinical data set to classify hepatic lesions, see [<a href="https://arxiv.org/abs/1304.0530" target="_blank">1</a>].</p> <p>I made publicly available all <a href="https://github.com/markolalovic/tda-digits" target="_blank">scripts</a> including a processed version of the dataset. I am also using freely available computational topology package called <a href="https://github.com/mrzv/dionysus" target="_blank">Dionysus 2</a> for the computation of persistent homology.</p> <h2 id="introduction">Introduction</h2> <p>Topology applied to real-world data sets using persistent homology has begun to look for applications in machine learning, including deep learning [<a href="https://arxiv.org/pdf/1905.12200.pdf" target="_blank">2</a>]. Topology is mainly used in a pre-processing step to provide robust features for learning. Our data is often a finite set of noisy samples from some underlying space. The developed topological techniques, mostly deal with point clouds, i.e. finite sets of data points in space. Point clouds are typically captured by a variety of imaging devices, such as MRI or CT scanners. With the greater availability of such devices, this type of data is being generated at an increasing rate. The data sets are often also very noisy and contain a lot of missing information, especially biological data sets. Our ability to analyze this data, both in terms of the amount and the nature of the data, is clearly out of step with the data we generate [<a href="https://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/S0273-0979-09-01249-X.pdf" target="_blank">3</a>]. Topology can be used to make a useful contribution to the analysis of such data sets and it is especially helpful in studying them qualitatively.</p> <h3 id="terminology">Terminology</h3> <ul> <li><em>Topology</em> is a branch of mathematics that deals with qualitative geometric information. This includes the classification of loops and higher-dimensional surfaces.</li> <li><em>Topological data analysis</em> and <em>computational topology</em> deal with the study of topology using a computer.</li> <li><em>Persistent homology</em> is an algebraic method for discerning topological features of data.</li> <li><em>Connected component</em> (or connected cluster of points) is a 0-dimensional feature and <em>cycle</em> (or <em>loop</em>) is a 1-dimensional feature.</li> <li><em>Simplicial complex</em> is a set composed of points, line segments, triangles, and their n-dimensional counterparts.</li> <li><em>Filtration</em> is the sequence of simplicial complexes, with an inclusion map from each simplicial complex to the next.</li> <li><em>Barcode</em> is a visual representation of the persistence of the topological features. Longer bars represent significant features of the data. Shorter bars are due to irregularities or noise.</li> </ul> <h2 id="methods">Methods</h2> <h3 id="extracting-the-graph-structure">Extracting the graph structure</h3> <p>The pre-processing steps to expose the digits topology, shown for the image of a handwritten digit 8 in <strong>Figure 3</strong>, are the following:</p> <ol> <li>Load the MNIST image of a handwritten digit.</li> <li>Produce a binary image by thresholding.</li> <li>Reduce the image to a skeleton of 1-pixel width using the popular <a href="https://github.com/linbojin/Skeletonization-by-Zhang-Suen-Thinning-Algorithm" target="_blank">Zhang-Suen</a> thinning algorithm.</li> </ol> <table> <thead> <tr> <th style="text-align: center"><img src="/blog/assets/posts/tda-digits/1_original-image.png" alt="1_original-image.png" /></th> <th style="text-align: center"><img src="/blog/assets/posts/tda-digits/2_binary-image.png" alt="2_binary-image.png" /></th> <th style="text-align: center"><img src="/blog/assets/posts/tda-digits/3_skeleton.png" alt="3_skeleton.png" /></th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><span class="subfig">(1) Original image</span></td> <td style="text-align: center"><span class="subfig">(2) Binary image</span></td> <td style="text-align: center"><span class="subfig">(3) Skeleton</span></td> </tr> </tbody> </table> <div class="images"> <div class="label"> <strong>Figure 3:</strong> Pre-processing steps. </div> </div> <p>To construct the graph $G$ from the 1-pixel width skeleton, we treat the pixels of the skeleton as vertices of $G$ and add the edges between adjacent vertices and then remove all created cycles of length 3. Intuitively, we connect the points while trying not to create new topological features. The result for the image of a handwritten digit 8 is in <strong>Figure 4</strong>. The resulting graph in this example consists of one connected component with two cycles, so no new topological features were added.</p> <div class="images"> <img src="/blog/assets/posts/tda-digits/graph.svg" alt="graph" width="300" /> <div class="label"> <strong>Figure 4:</strong> Extracted graph $G$ as we sweep across the skeleton of a handwritten digit 8. </div> </div> <h3 id="extracting-the-topological-features">Extracting the topological features</h3> <p>We construct a simplex stream for computing the persistent homology using the following filtration on the graph $G$ embedded in the plane. We are adding the vertices and edges of graph $G$ as we sweep across the plane. From the simplex stream, we compute the persistent homology to get the Betti barcodes. The <a href="https://github.com/mrzv/dionysus" target="_blank">Dionysus 2</a> package was used for the computation of persistent homology; see the <a href="https://github.com/markolalovic/tda-digits" target="_blank">Source code</a> and package documentation [<a href="https://mrzv.org/software/dionysus2/" target="_blank">4</a>] for details.</p> <p><strong>Figure 5</strong> shows the result of this technique applied on the image of a handwritten digit 8. In this example, we sweep across the plane to the top in a vertical direction $y \in (-\infty, \infty)$.</p> <div class="images"> <img src="/blog/assets/posts/tda-digits/betti-barcodes.svg" /> <div class="label"> <strong>Figure 5:</strong> Result of applying this technique on <strong>(A)</strong> an image of a handwritten digit 8, <strong>(B)</strong> extracted embedded graph and <strong>(C)</strong> the resulting Betti barcodes at $y = 22$. </div> </div> <p>Betti 0 barcode $\beta_{0}$ consists of one interval $[2, \infty)$, which clearly shows the single connected component with the birth time of 2 when the first vertex of $G$ is added.</p> <p>Betti 1 barcode $\beta_{1}$ consists of two intervals $[10, \infty)$ and $[18, \infty)$, with birth times of 10 and 18 corresponding to the births of the two cycles. The birth of the cycle is the value of $y$ when the loop closes in the embedded graph $G$ as we sweep to the top.</p> <p>Finally, we extract the following features from each of the k-dimensional Betti barcodes in the following way. Denote the endpoints of Betti barcode intervals with:</p> $x_{1}, y_{1}, ..., x_{n}, y_{n}$ <p>where $x_{i}$ represents the beginning and $y_{i}$ the end of each interval. From the endpoints we compute 4 <em>features</em> from the invariants discussed in [<a href="https://arxiv.org/abs/1304.0530" target="_blank">1</a>], that take into account all of the bars lengths, and endpoints:</p> \begin{align*} f_{1}&amp;=\sum_{i} x_{i} (y_{i} - x_{i}) \\ f_{2}&amp;=\sum_{i} (y_{\max} - y_{i})(y_{i} - x_{i}) \\ f_{3}&amp;=\sum_{i} (y_{\max} - y_{i})^{2} (y_{i} - x_{i})^{4} \\ f_{4}&amp;=\sum_{i} x_{i}^{2} (y_{i} - x_{i})^{4} \\ \end{align*} <p>For each of the 4 sweep directions: top, bottom, right, left and for each $k$-dimensional Betti barcode, $k = 0, 1$, we compute the defined 4 features $f_{1}, \dots, f_{4}$. This gives us a total of $4 \cdot 2 \cdot 4 = 32$ features per image.</p> <h2 id="empirical-results">Empirical results</h2> <p>Data set consisted of extracted topological features of 10000 images of handwritten digits from MNIST database. Data set was split 50:50 in train and test set so that each had 5000 examples. SVM with RBF kernel was used for classification of the images based on the extracted topological features. The empirical classification results are as follows. Accuracy on the train set using 10-fold cross-validation was $0.88 (\pm 0.05)$. Accuracy on the test set was 0.89.</p> <p>We examine the common misclassifications.</p> <div class="images"> <img src="/blog/assets/posts/tda-digits/miss-2.png" width="500" /> <div class="label"> <strong>Figure 6:</strong> Examples of number 2 being mistaken for number 0. </div> </div> <p>There were 3 examples of the number 2 being mistaken for number 0, shown in <strong>Figure 6</strong>. The reason is that the number 2 was written with a loop that appears in the region that is close to the loop in number 0.</p> <div class="images"> <img src="/blog/assets/posts/tda-digits/miss-5.png" width="500" /> <div class="label"> <strong>Figure 7:</strong> Examples of number 5 being mistaken for number 2. </div> </div> <p>For number 5 we got the lowest F1 score of 0.75. It was misclassified as number 2 in 32 examples in the test set. The first three examples are shown in <strong>Figure 7</strong>. This was expected since these two numbers are topologically the same with no topological features (e.g. loops) appearing in different regions.</p> <div class="images"> <img src="/blog/assets/posts/tda-digits/miss-8.png" width="500" /> <div class="label"> <strong>Figure 8:</strong> Examples of number 8 being mistaken for number 4. </div> </div> <p>Number 8 was misclassified as number 4 in 21 examples from the test set. The first three examples are shown in <strong>Figure 8</strong>. We see the stylistic problems that caused the misclassifications. The top loop of number 8 was not closed which made it topologically more similar to number 4 written with a loop.</p> <h2 id="source-code">Source code</h2> <p>The repository containing all Python scripts that I wrote for this tutorial including the processed version of the data set is available in <a href="https://github.com/markolalovic/tda-digits" target="_blank">tda-digits</a> repository on GitHub.</p> <p>It’s dependencies are:</p> <ul> <li>Python (2 or 3);</li> <li><a href="https://github.com/mrzv/dionysus" target="_blank">Dionysus 2</a> for computing persistent homology;</li> <li>Boost version 1.55 or higher for Dionysus 2;</li> <li>NumPy for loading data and computing;</li> <li>Scikit-learn for machine learning algorithms;</li> <li>Scikit-image for image pre-processing;</li> <li>Matplotlib for plotting;</li> <li>Networkx for plotting graphs.</li> </ul> <h2 id="references">References</h2> <p> A. Adcock, E. Carlsson, and G. Carlsson, “The Ring of Algebraic Functions on Persistence Bar Codes”, (2013), Homology, Homotopy and Applications <a target="_blank" href="https://arxiv.org/abs/1304.0530">https://arxiv.org/abs/1304.0530</a></p> <p> R. Bruel-Gabrielsson, B. J. Nelson, A. Dwaraknath, P. Skraba, L. J. Guibas, and G. Carlsson, “A Topology Layer for Machine Learning”, (2020), Proceedings of Machine Learning Research <a target="_blank" href="https://arxiv.org/pdf/1905.12200.pdf">https://arxiv.org/pdf/1905.12200.pdf</a></p> <p> G. Carlsson, “Topology and data”, (2009), Bulletin of the American Mathematical Society <a target="_blank" href="https://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/S0273-0979-09-01249-X.pdf">https://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/S0273-0979-09-01249-X.pdf</a></p> <p> Dmitriy Morozov, “Dionysus 2 documentation”. <a target="_blank" href="https://mrzv.org/software/dionysus2/">https://mrzv.org/software/dionysus2/</a></p>Marko LalovicPersistent homology is a fascinating mathematical tool that continues to be studied, developed, and applied. The purpose of this tutorial is to give a friendly introduction on how to use the persistent homology that does not require substantial knowledge of topological methods. To illustrate the use of persistent homology in machine learning we apply it to the MNIST data set of handwritten digits. A very similar approach can be applied to any point cloud data and can be generalized to higher dimensions.Some math behind water towers2020-06-15T20:26:36+00:002020-06-15T20:26:36+00:00https://markolalovic.com/blog/water-towers<div class="images"> <img src="/blog/assets/posts/water-towers/drawing.svg" /> <div class="label"> <strong>Figure 1:</strong> Two examples of arrangements of metal rings. </div> </div> <p>The structure of water towers reveals some math behind how they are build. In <strong>Figure 1</strong>, is an illustration of two water towers reinforced with metal rings to hold the cylindrical structure of the water tanks together. Two examples of arrangements are shown. In <strong>Figure 1B</strong>, the metal rings are closer to each other at the bottom of the tank where the pressure that the water exerts on the tank is the highest. But what is the optimal arrangement?</p> <p>Since I did not find a derivation anywhere on the internet, I will describe my solution here with this simple example.</p> <h2 id="example">Example</h2> <p>Say we have a wooden water tower that we want to reinforce with $K=5$ equal metal rings so that the wood does not give in to the water pressure. Let $d=3$ be the depth of the water tank. Denote by $h_{k}$ for $k=1, …, 5​​$ the height of each metal ring from the bottom of the tank.</p> <p><strong>Figure 1A</strong> shows a naïve approach, where the metal rings are spaced evenly at heights from the bottom of the tank</p> $h_{1} = 0.5, \quad h_{2} = 1, \quad h_{3} = 1.5, \quad h_{4} = 2, \quad h_{5} = 2.5$ <p>The pressure that the water exerts increases with depth and the metal rings at the bottom of the tank have to hold a lot more pressure than the metal rings at the top of the tank. Therefore the structural strength of the lower part of the tank is lower than that of the upper part of the tank. Hence, this arrangement of metal rings is a poor choice.</p> <h3 id="optimal-arrangement">Optimal Arrangement</h3> <p>To derive the optimal arrangement of the metal rings shown in <strong>Figure 1B</strong>, let’s assume that the water temperature in the tank is constant and the same everywhere. Also, the influence of depth on water density is negligible. Then the water pressure $p$ at depth $x$ is equal to</p> $p(x) = p_{0} + \rho \cdot g \cdot x$ <p>where $p_{0}$ is the pressure at the surface, $\rho$ is the density of water and $g$ is the gravitational acceleration. The important part to notice for this derivation, is that the pressure exerted by a static liquid increases linearly with depth</p> $p(x) \propto C \cdot x$ <p>for some constant $C &gt; 0$. We choose the constant $C = 2/9$ and define the density</p> $f_{X}(x) = \frac{2}{9} \cdot x$ <p>of some continuously distributed random variable $X$ that represents the pressure in the tank, defined on the interval $[0, 3]$. The corresponding distribution function is then</p> $F_{X}(x) = \frac{1}{9} \cdot x^{2}$ <p>Define the limits of the interval $[0, 3]$</p> $x_{1} = 0, \quad x_{6} = 3$ <p>and divide the interval $[0, 3]$ into 5 subintervals with boundaries:</p> $x_{2} = x_{0.2}, \quad x_{3} = x_{0.4}, \quad x_{4} = x_{0.6}, \quad x_{5} = x_{0.8}$ <p>where $x_{q}$ are quantiles, that is</p> $F_{X}(x_{q}) = q$ <p>The boundaries are marked as dashed red lines in <strong>Figure 1B</strong>. At each subinterval, the integral of the density is equal to 0.2. This method, of dividing a continuous distribution into subintervals with equal density, is called <em>equal frequency discretization</em>.</p> <p>Finally, we arrange the metal rings so that each ring is exactly in the centroid $h_{k}$​ of density $f_{X}$ at the subinterval $k$</p> $h_{k} = \frac{\int_{x_{k}}^{x_{k+1}} x f(x) dx}{\int_{x_{k}}^{x_{k+1}} f(x) dx}$ <p>This way, each metal ring needs to hold the same amount of pressure. Therefore the structural strength is the same everywhere and this arrangement of metal rings is optimal.</p> <p>In the optimal arrangement, the calculated heights of the metal rings from the bottom of the tank are approximately</p> $h_{1} = 0.16, \quad h_{2} = 0.49, \quad h_{3} = 0.88, \quad h_{4} = 1.36, \quad h_{5} = 2.11$ <p>The solution is shown in <strong>Figure 1B</strong>.</p> <div class="images"> <img src="/blog/assets/posts/water-towers/water-towers.jpg" /> <div class="label"> <strong>Figure 2:</strong> Water towers in New York. Image credit: © <a target="_blank" href="https://www.flickr.com/people/38782010@N00">takomabibelot</a> licensed <a target="_blank" href="https://creativecommons.org/licenses/by/2.0/">CC-BY 2.0</a>. </div> </div> <p>Using this procedure, we can compute the optimal arrangement for any number of metal rings. For example, in <strong>Figure 2</strong>, the water tower in the front uses 11 metal rings.</p>Marko LalovicFigure 1: Two examples of arrangements of metal rings.Ridge regression2019-08-01T09:31:22+00:002019-08-01T09:31:22+00:00https://markolalovic.com/blog/ridge-regression<div class="images"> <img src="/blog/assets/posts/ridge-regression/problem.svg" /> <div class="label"> <strong>Figure 1:</strong> As the correlation between regressors increases the OLS method becomes unstable while the ridge regression method produces stable estimates regardless of the given data in $X$. </div> </div> <p>The ordinary least squares (OLS) method is not suitable to estimate the unknown parameters $\beta$ in the case of highly correlated regressors. As the correlation between regressors in $X$ increases the OLS method becomes unstable. In the limit $|corr(x_{i}, x_{j})| \rightarrow 1$, the OLS objective function is no longer strictly convex and there are infinitely many solutions of OLS problem. The matrix $X$ becomes singular and both the variance of the estimator and the distance of the estimator to the actual $\beta$ go to infinity; see my previous <a href="https://markolalovic.com/blog/ols-regression">post</a> for more on this.</p> <p>Here, after introducing penalized regression, we derive the ridge regression estimator. Ridge regression is an effective approach to solve such problems. We show that, regardless of data $X$, unique solution to ridge regression always exists. By adding the ridge (vector of $\alpha$’s) on the diagonal of $X$, the ridge regression method produces stable estimates of the coefficients in $\beta$. See <strong>Figure 1</strong> for illustration.</p> <p>We illustrate the method on a simple example in R, explain the role of the penalty function and finish with the analysis of regularization parameter $\alpha$.</p> <h2 id="penalized-regression">Penalized regression</h2> <p>In <em>penalized regression</em>, for $n &gt; p$ and given $X: \mathbb{R}^{p} \rightarrow \mathbb{R}^{n}$ and $y \in \mathbb{R}^{n}$, we minimize the functional</p> $\begin{equation} \newcommand{\norm}{\left\lVert#1\right\rVert} J_{\alpha}(\beta) = \norm{ y - X\beta }_{2}^{2} + \alpha P(\beta) \end{equation}$ <p>over $\beta \in \mathbb{R}^{p}$, where</p> <ul> <li>$J_{\alpha}: \mathbb{R}^{p} \rightarrow \mathbb{R}$ is the objective function;</li> <li>$P: \mathbb{R}^{p} \rightarrow \mathbb{R}$ is a <em>penalty function</em> that penalizes unrealistic values in $\beta$;</li> <li>Parameter $\alpha &gt; 0$ controls the trade-off between the penalty and the fit of the loss function.</li> </ul> <p>The main idea that determines the choice of the penalty function is that we would prefer a simple model to a more complex one.</p> <p>There are many different possibilities for penalty function $P$. For example, if we want a smoother fit, then $P$ is a measure of the curvature.</p> <p>In the case of correlated regressors, the estimated coefficients can become too large and $P$ is a measure of the distance of the coefficients from the origin. In this case, the main penalty function to consider is</p> $\begin{equation} P(\beta) = \norm{\beta}_{2}^{2} \end{equation}$ <p>This type of penalized regression is called <em>Ridge regression</em>; see original paper [<a href="https://www.nrs.fs.fed.us/pubs/rn/rn_ne236.pdf" target="_blank">1</a>].</p> <h2 id="derivation-of-ridge-regression-estimator">Derivation of ridge regression estimator</h2> <p>Here, in order to simplify the derivation, we will assume that $X: \mathbb{R}^{p} \rightarrow \mathbb{R}^{n}$ is linear and continuous with full column rank $p$.</p> <p>The objective function we want to minimize, written in a matrix form, is</p> \begin{align} \norm{ y - X \beta }_{2}^{2} + \alpha \norm{\beta}_{2}^2 &amp;= (y - X \beta)^{T} (y - X \beta) + \alpha \beta^{T} \beta \nonumber \\[1em] &amp;= y^{T}y - 2 y^{T} X \beta + \beta^{T} X^{T}X \beta + \alpha \beta^{T} \beta \end{align} <p>By taking a partial derivative with respect to $\beta$ and setting it to zero \begin{equation} -2 X^{T} y + 2 X^{T} X \hat{\beta} + 2 \alpha \hat{\beta} = 0 \end{equation}</p> <p>we get a regularized normal equation \begin{equation} (X^{T}X + \alpha I) \hat{\beta} = X^{T}y \end{equation}</p> <p>we can express $\hat{\beta}$ as \begin{equation} \hat{\beta} = (X^{T}X + \alpha I)^{-1} X^{T} y \end{equation}</p> <p>and since $\text{rank}(X) = p$ \begin{equation} X z \neq 0 \quad \text{for each} \quad z \neq 0 \end{equation}</p> <p>for the Hessian \begin{equation} 2X^{T}X + 2 \alpha \end{equation}</p> <p>it holds that</p> \begin{align} 2 z^{T} X^{T} X z + 2 \alpha z^{T} z &amp;= 2 (Xz)^{T} (Xz) + 2 \alpha z^{T} z \nonumber \\[1em] &amp;= 2 \norm{Xz}_{2}^{2} + 2 \alpha \norm{z}_{2}^{2} &gt; 0 \quad \text{for each} \quad z \neq 0 \label{eq: positive-definite} \end{align} <p>therefore, the expressed $\hat{\beta}$ is really an estimator</p> <p>\begin{equation} \hat{\beta}_{\text{RR}} = (X^{T}X + \alpha I)^{-1} X^{T} y. \label{eq: rr} \end{equation}</p> <h3 id="example">Example</h3> <p>We can illustrate the ridge regression method to estimate the unknown parameters $\beta$ in the case of correlated regressors on a simple example in R.</p> <p>Suppose we have a model $y \sim \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2}$, more specifically, let</p> <p>\begin{equation} \beta_{0} = 3, \quad \beta_{1} = \beta_{2} = 1. \end{equation}</p> <p>and let the sample contain 100 elements</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">100</span><span class="w"> </span></code></pre></div></div> <p>We then introduce some highly correlated regressors</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w"> </span><span class="n">x1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">x2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>with correlation coefficient almost 1</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cor</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">x2</span><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.999962365268769</span><span class="w"> </span></code></pre></div></div> <p>into the model</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x2</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>and calculate the estimate $\hat{\beta}_{\text{RR}}$ for $\alpha = 0.3$</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">alpha</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0.3</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">int</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">x2</span><span class="p">))</span><span class="w"> </span><span class="n">beta.ridge</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">xx</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="m">3</span><span class="p">))</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">as.vector</span><span class="p">(</span><span class="n">xx</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">beta.ridge</span><span class="p">(</span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">2.98537494896842</span><span class="w"> </span><span class="m">0.815120466450887</span><span class="w"> </span><span class="m">1.04146900239714</span><span class="w"> </span></code></pre></div></div> <h3 id="properties-of-ridge-regression-estimator">Properties of ridge regression estimator</h3> <ul> <li> <p>The unique solution \ref{eq: rr} of ridge regression $\hat{\beta}_{\text{RR}}$ always exists, since $X^{T}X + \alpha I$ is always rank $p$.</p> </li> <li> <p>We can derive a relationship between ridge and OLS estimators for the case when the matrix $X$ is orthogonal. Using $X^{T}X = I$ twice and since $\hat{\beta}_{\text{OLS}} = (X^{T}X)^{-1} X^{T} y$, we get the relation</p> </li> </ul> \begin{align} \hat{\beta}_{\text{RR}} &amp;= (X^{T}X + \alpha I)^{-1} X^{T} y \nonumber \\[1em] &amp;= (I + \alpha I)^{-1} X^{T} y \nonumber \\[1em] &amp;= (1 + \alpha)^{-1} I X^{T} y \nonumber \\[1em] &amp;= (1 + \alpha)^{-1} (X^{T}X)^{-1} X^{T} y \nonumber \\[1em] &amp;= (1 + \alpha)^{-1} \hat{\beta}_{\text{OLS}} \end{align} <ul> <li>Ridge regression estimator $\hat{\beta}_{\text{RR}}$ is biased since, for each value of $\alpha &gt; 0$, its expected value is not equal to $\beta$</li> </ul> \begin{align} \mathbb{E}[\hat{\beta}_{ridge}] &amp;= \mathbb{E}[(X^{T}X + \alpha I)^{-1} X^{T} y] \nonumber \\[1em] &amp;= \mathbb{E}[(X^{T}X + \alpha I)^{-1} (X^{T}X) (X^{T}X)^{-1} X^{T} y] \nonumber \\[1em] &amp;= \mathbb{E}[(X^{T}X + \alpha I)^{-1} (X^{T}X) \hat{\beta}_{\text{OLS}}] \nonumber \\[1em] &amp;= (X^{T}X + \alpha I)^{-1} (X^{T}X) \mathbb{E}[\hat{\beta}_{\text{OLS}}] \nonumber \\[1em] &amp;= (X^{T}X + \alpha I)^{-1} (X^{T}X) \beta. \end{align} <ul> <li>Also, as $\alpha \rightarrow 0$, ridge estimator tends to OLS estimator, which can easiliy be seen from</li> </ul> \begin{align} \lim_{\alpha \to 0} \hat{\beta}_{\text{RR}} &amp;= \lim_{\alpha \to 0} (X^{T}X + \alpha I)^{-1} (X^{T}X) \hat{\beta}_{\text{OLS}} \nonumber \\[1em] &amp;= (X^{T}X)^{-1} (X^{T}X) \hat{\beta}_{\text{OLS}} \nonumber \\[1em] &amp;= \hat{\beta}_{\text{OLS}} \end{align} <h2 id="the-role-of-the-penalty-function">The role of the penalty function</h2> <p>The role of the penalty function can be shown conveniently with the help of singular value decomposition. Let</p> <p>\begin{equation} X = U \Sigma V^{T} \end{equation}</p> <p>be the singular value decomposition of $X$ where $\Sigma$ contains all the singular values</p> <p>\begin{equation} \sigma_{1} \geq \sigma_{2} \geq \dots \geq \sigma_{p} &gt; 0 \end{equation}</p> <p>The regularized normal equation \begin{align} ( X^{T} X + \alpha I ) \hat{\beta} = X^{T} y \end{align}</p> <p>can be rewritten as</p> \begin{align} (V \Sigma^{T} U^{T}U \Sigma V^{T} + \alpha I) \hat{\beta} = V \Sigma^{T} U^{T} y \end{align} <p>Then, since $U^{T}U = I$ and $V^{T}V = I$, we get</p> \begin{align} (V \Sigma^{T} \Sigma V^{T} + \alpha V^{T}V) \hat{\beta} &amp;= V \Sigma^{T} U^{T} y \nonumber \\[1em] V (\Sigma^{T} \Sigma + \alpha I) V^{T} \hat{\beta} &amp;= V \Sigma^{T} U^{T} y \end{align} <p>Furthermore, multiplying it by $V^{T}$ from the left and setting $z := V^{T} \hat{\beta}$, we get \begin{equation} (\Sigma^{T} \Sigma + \alpha I) z = \Sigma^{T} U^{T} y \end{equation}</p> <p>Therefore</p> <p>\begin{equation} z_{i} = \frac{\sigma_{i} (u_{i}^{T} y)}{\sigma_{i}^{2} + \alpha} \quad \text{for} \quad i = 1, \dots, p \end{equation}</p> <p>And, for minimum norm solution, let \begin{equation} z_{i} = 0 \quad \text{for} \quad i = p + 1, \dots, n \end{equation}</p> <p>Finally, from $\hat{\beta} = V z$ and since $V$ is orthogonal</p> $\begin{equation} \norm{\hat{\beta}} = \norm{VV^{T} \hat{\beta}} = \norm{V^{T}\hat{\beta}} = \norm{z} \end{equation}$ <p>we get</p> $\begin{equation} \hat{\beta}_{i} = \frac{\sigma_{i} (u_{i}^{T} y)}{\sigma_{i}^{2} + \alpha} v_{i} \label{eq: beta_i} \end{equation}$ <p>And from</p> $\begin{equation} \hat{\beta}_{i} \approx \begin{cases} 0, &amp; \text{if } \sigma_{i} &lt;&lt; \alpha \\ \frac{u_{i}^{T} y}{\sigma_{i}}v_{i}, &amp; \text{if } \sigma_{i} &gt;&gt; \alpha \end{cases} \end{equation}$ <p>we can see that the penalty function $\alpha \norm{\beta}_{2}^{2}$ acts as a filter since the contributions</p> <ul> <li>from $\sigma_{i}$ that is small relative to the regularization parameter $\alpha$ are almost eliminated;</li> <li>from $\sigma_{i}$ that is large relative to the regularization parameter $\alpha$ are left almost unchanged.</li> </ul> <p>By defining a filter</p> $\begin{equation} F_{\alpha}(\xi) = \frac{1}{\xi + \alpha} \end{equation}$ <p>the solution of ridge regression can be further expressed as</p> $\begin{equation} \hat{\beta}_{\text{RR}} = F_{\alpha}(X^{T}X) X^{T}y \end{equation}$ <h2 id="regularization-parameter-alpha">Regularization parameter $\alpha$</h2> <p>First, we notice that the solution of ridge regression is monotonically decreasing in $\alpha$.</p> <p>To show this, let</p> $\begin{equation} \psi(\alpha) = \norm{\hat{\beta}_{\text{RR}}}_{2}^{2} \end{equation}$ <p>Then, from derived equation for $\hat{\beta}_{i}$ in \ref{eq: beta_i}, we have that</p> $\begin{equation} \psi(\alpha) = \sum_{i = 1}^{p} \frac{\sigma_{i}^{2} (u_{i}^{T} y)^{2}}{ (\sigma_{i}^{2} + \alpha)^{2} } v_{i}^{2} \end{equation}$ <p>and taking derivative on $\alpha$</p> $\begin{equation} \psi'(\alpha) = -2 \sum_{i = 1}^{p} \frac{\sigma_{i}^{2} (u_{i}^{T} y)^{2}}{ (\sigma_{i}^{2} + \alpha)^{3} } v_{i}^{2} &lt; 0 \end{equation}$ <p>Second, as $\alpha \rightarrow \infty$ the solution of ridge regression goes to $\boldsymbol{0}$, since</p> $\begin{equation} \lim_{\alpha \rightarrow \infty} \psi(\alpha) = \lim_{\alpha \rightarrow \infty} \sum_{i = 1}^{p} \frac{\sigma_{i}^{2} (u_{i}^{T} y)^{2}}{ (\sigma_{i}^{2} + \alpha)^{2} } v_{i}^{2} = 0 \end{equation}$ <p>In the limit $\alpha \rightarrow 0$, as shown before, the solution of ridge regression goes to ordinary least squares solution. Furthermore, if $\sigma_{p} \rightarrow 0$ where $X$ is no longer full column rank, then $\psi(\alpha) \rightarrow \infty$.</p> <p>We can plot how the estimates $\beta_{0}, \beta_{1}, \beta_{2}$ change depending on the value of parameter $\alpha$ for the data of the <a href="#example">Example</a>; shown in <strong>Figure 2</strong> below</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">alphas</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">-5</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">0.1</span><span class="p">))</span><span class="w"> </span><span class="n">betas</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">alphas</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">alpha</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">beta.ridge</span><span class="p">(</span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="p">})</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">latex2exp</span><span class="p">)</span><span class="w"> </span><span class="c1"># for annotation</span><span class="w"> </span><span class="n">plot</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">alphas</span><span class="p">),</span><span class="w"> </span><span class="n">betas</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="o">=</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="o">=</span><span class="n">TeX</span><span class="p">(</span><span class="n">r</span><span class="s1">'($\log(\alpha)$)'</span><span class="p">),</span><span class="w"> </span><span class="n">ylab</span><span class="o">=</span><span class="n">TeX</span><span class="p">(</span><span class="n">r</span><span class="s1">'($\hat{\beta}$)'</span><span class="p">),</span><span class="w"> </span><span class="n">cex.lab</span><span class="o">=</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="o">=</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="o">=</span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">cex.sub</span><span class="o">=</span><span class="m">1.5</span><span class="p">)</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">alphas</span><span class="p">),</span><span class="w"> </span><span class="n">betas</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="o">=</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"blue"</span><span class="p">)</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">alphas</span><span class="p">),</span><span class="w"> </span><span class="n">betas</span><span class="p">[</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="o">=</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="o">=</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="s2">"black"</span><span class="p">)</span><span class="w"> </span><span class="n">legend</span><span class="p">(</span><span class="m">7.73</span><span class="p">,</span><span class="w"> </span><span class="m">3.12</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="n">r</span><span class="s1">'($\hat{\beta}_{1}(\alpha)$)'</span><span class="p">),</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="n">r</span><span class="s1">'($\hat{\beta}_{2}(\alpha)$)'</span><span class="p">),</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="n">r</span><span class="s1">'($\hat{\beta}_{3}(\alpha)$)'</span><span class="p">)),</span><span class="w"> </span><span class="n">col</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"black"</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="o">=</span><span class="nf">rep</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">lty</span><span class="o">=</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">1.5</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <div class="images"> <img src="/blog/assets/posts/ridge-regression/ridge-solution-path.svg" /> <div class="label"> <strong>Figure 2:</strong> The solution of ridge regression as a function of the regularization parameter $\alpha$. </div> </div> <p>The selection of $\alpha$ is usually done by cross-validation. This means that we randomly partition the data into $K$ equally sized sets. For some value of $\alpha$ we then build a model (calculate estimates for the coefficients) on the data from $K - 1$ sets (learning set) and test it on the rest of the data (test set). Of course, we are interested in the mean square error (MSE). We then repeat the process for the remaining values of $\alpha$ and select the value of $\alpha$ where this average is the smallest. Typical values for $K$ are $5, 10$, and $n$ (sample size).</p> <p>Let’s find the optimal value of parameter $\alpha$ for the data of the <a href="#example">Example</a> using 10-fold cross-validation</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">K</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="n">folds</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cut</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">x</span><span class="p">)),</span><span class="w"> </span><span class="n">breaks</span><span class="o">=</span><span class="n">K</span><span class="p">,</span><span class="w"> </span><span class="n">labels</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="n">cv.matrix</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="o">=</span><span class="n">K</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="o">=</span><span class="nf">length</span><span class="p">(</span><span class="n">alphas</span><span class="p">))</span><span class="w"> </span><span class="n">mse</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">b</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="m">1</span><span class="o">/</span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sum</span><span class="p">((</span><span class="n">y</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">b</span><span class="p">)</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">k</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">K</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">test.i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">folds</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">alphas</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">br</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">beta.ridge</span><span class="p">(</span><span class="n">alphas</span><span class="p">[</span><span class="n">j</span><span class="p">],</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="o">-</span><span class="n">test.i</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">y</span><span class="p">[</span><span class="o">-</span><span class="n">test.i</span><span class="p">])</span><span class="w"> </span><span class="n">cv.matrix</span><span class="p">[</span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mse</span><span class="p">(</span><span class="n">br</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="n">test.i</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">y</span><span class="p">[</span><span class="n">test.i</span><span class="p">])</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">avgs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">cv.matrix</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w"> </span><span class="n">best.alpha</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">alphas</span><span class="p">[</span><span class="n">avgs</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">avgs</span><span class="p">)]</span><span class="w"> </span><span class="n">best.alpha</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.246596963941606</span><span class="w"> </span></code></pre></div></div> <h2 id="references">References</h2> <p> Hilt, Donald E.; Seegrist, Donald W. 1977. Ridge: a computer program for calculating ridge regression estimates. Research Note NE-236. Upper Darby, PA: U.S. Department of Agriculture, Forest Service, Northeastern Forest Experiment Station. 7p. <a target="_blank" href="https://www.nrs.fs.fed.us/pubs/rn/rn_ne236.pdf">https://www.nrs.fs.fed.us/pubs/rn/rn_ne236.pdf</a></p>Marko LalovicFigure 1: As the correlation between regressors increases the OLS method becomes unstable while the ridge regression method produces stable estimates regardless of the given data in $X$.Multicollinearity effect on OLS regression2019-07-14T20:41:01+00:002019-07-14T20:41:01+00:00https://markolalovic.com/blog/ols-regression<div class="images"> <img src="/blog/assets/posts/ols-regression/problem.svg" /> <div class="label"> <strong>Figure 1:</strong> As the correlation between regressors increases, the OLS method becomes unstable. In the limit $|corr(x_{i}, x_{j})| \rightarrow 1$, the OLS objective function is no longer strictly convex and OLS solution is no longer unique. </div> </div> <p>In <em>ordinary least squares (OLS) regression</em>, for given $X: \mathbb{R}^{p} \rightarrow \mathbb{R}^{n}$ and $y \in \mathbb{R}^{n}$, we minimize over $\beta \in \mathbb{R}^{p}$, the sum of squared residuals</p> $\newcommand{\norm}{\left\lVert#1\right\rVert} \begin{equation} S(\beta) = \norm{ y - X\beta }_{2}^{2} \end{equation}$ <p>We first illustrate the problem with using ordinary least squares (OLS) method to estimate the unknown parameters $\beta$ in the case of highly correlated regressors on a simple example in R.</p> <p>Suppose we have a model $y \sim \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2}$, more specifically, let</p> <p>\begin{equation} \beta_{0} = 3, \quad \beta_{1} = \beta_{2} = 1. \end{equation}</p> <p>and let the sample contain 100 elements</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">100</span><span class="w"> </span></code></pre></div></div> <p>We then introduce some highly correlated regressors</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w"> </span><span class="n">x1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">x2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">)</span><span class="w"> </span></code></pre></div></div> <p>with correlation coefficient almost 1</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cor</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">x2</span><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.999962365268769</span><span class="w"> </span></code></pre></div></div> <p>Let’s run the OLS method 1000 times to get a sense of the effect of highly correlated regressors</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">intr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">x2</span><span class="p">))</span><span class="w"> </span><span class="n">nsim</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1000</span><span class="w"> </span><span class="n">betas</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">nsim</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x2</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="n">xx</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">beta.ols</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.vector</span><span class="p">(</span><span class="n">xx</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="n">beta.ols</span><span class="p">)</span><span class="w"> </span><span class="p">})</span><span class="w"> </span></code></pre></div></div> <p>The estimator for $\beta$, obtained by the OLS method, is still unbiased</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">round</span><span class="p">(</span><span class="n">apply</span><span class="p">(</span><span class="n">betas</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">),</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">2.996</span><span class="w"> </span><span class="m">1.006</span><span class="w"> </span><span class="m">0.993</span><span class="w"> </span></code></pre></div></div> <p>But the variance becomes to large</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">round</span><span class="p">(</span><span class="n">apply</span><span class="p">(</span><span class="n">betas</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="p">),</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.101</span><span class="w"> </span><span class="m">11.110</span><span class="w"> </span><span class="m">11.103</span><span class="w"> </span></code></pre></div></div> <p>The estimated coefficients can become too large, some can even have the wrong sign</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">round</span><span class="p">(</span><span class="n">betas</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">3.002</span><span class="w"> </span><span class="m">-7.673</span><span class="w"> </span><span class="m">9.529</span><span class="w"> </span></code></pre></div></div> <p>The problem can be seen by drawing a contour plot of the objective function $S(\beta)$</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ssr</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nf">return</span><span class="p">(</span><span class="nf">sum</span><span class="p">((</span><span class="n">y</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">b</span><span class="p">)</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">xlen</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="n">ylen</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="n">xgrid</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-10.1</span><span class="p">,</span><span class="w"> </span><span class="m">10.1</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xlen</span><span class="p">)</span><span class="w"> </span><span class="n">ygrid</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-10.1</span><span class="p">,</span><span class="w"> </span><span class="m">10.1</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylen</span><span class="p">)</span><span class="w"> </span><span class="n">zvals</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xlen</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylen</span><span class="p">)</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">xlen</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">ylen</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">zvals</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ssr</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">xgrid</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">ygrid</span><span class="p">[</span><span class="n">j</span><span class="p">]))</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="n">contour</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xgrid</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ygrid</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">zvals</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1e3</span><span class="p">,</span><span class="w"> </span><span class="m">3e3</span><span class="p">,</span><span class="w"> </span><span class="m">6e3</span><span class="p">,</span><span class="w"> </span><span class="m">1e4</span><span class="p">))</span><span class="w"> </span></code></pre></div></div> <p>As shown in <strong>Figure 1</strong>, as the correlation between regressors increases, matrix $X$ becomes nearly singular and OLS method becomes unstable.</p> <p>In the limit</p> <p>\begin{equation} |corr(x_{i}, x_{j})| \rightarrow 1 \end{equation}</p> <p>the dimension of the column space decreases</p> <p>\begin{equation} \text{rank}(X) &lt; p \end{equation}</p> <p>the objective function $S(\beta)$ is no longer strictly convex, and there are infinitely many solutions of OLS.</p> <p>Thinking about this in a different, more intuitive way, we would like to estimate the coefficient $\beta_{1}$ as the influence of $x_{1}$ on $y$ without the influence of $x_{2}$. Since the regressors $x_{1}$ and $x_{2}$ are highly correlated, they vary together and the coefficient $\beta_{1}$ is difficult to estimate.</p> <p>The OLS method does not solve these problems, as it only minimizes the sum of squared residuals, i.e. the objective function $S(\beta)$.</p> <h3 id="why-does-this-happen">Why does this happen?</h3> <p>In general, we have a linear model</p> <p>\begin{equation} \label{eq: model} y = X \beta + \epsilon \end{equation}</p> <p>and let for errors $\epsilon$ hold the assumption (Gauss–Markov)</p> <p>\begin{equation} \mathbb{E}[\epsilon \epsilon^{T}] = \sigma^{2} I \end{equation}</p> <p>the estimator according to the OLS method is</p> <p>\begin{equation} \label{eq: betahat} \hat{\beta} = (X^{T}X)^{-1} X^{T} y. \end{equation}</p> <p>From \ref{eq: model} and \ref{eq: betahat} we get</p> <p>\begin{equation} \hat{\beta} - \beta = (X^{T}X)^{-1} X^{T} \epsilon \end{equation}</p> <p>therefore, the covariance matrix for $\hat{\beta}$ is</p> \begin{align} \mathbb{E}[(\hat{\beta} - \beta) (\hat{\beta} - \beta)^{T}] \nonumber &amp;= \mathbb{E}[\left((X^{T}X)^{-1} X^{T} \epsilon\right) \left((X^{T}X)^{-1} X^{T} \epsilon\right)^{T}] \nonumber \\[1em] &amp;= \mathbb{E}[(X^{T}X)^{-1} X^{T} \epsilon \epsilon^{T} X (X^{T}X)^{-1}] \nonumber \\[1em] &amp;= (X^{T}X)^{-1} X^{T} \mathbb{E}[\epsilon \epsilon^{T}] X (X^{T}X)^{-1} \nonumber \\[1em] &amp;= \sigma^{2} (X^{T}X)^{-1} \label{eq: variance} \end{align} <p>where we took into account that the matrix $(X^{T}X)$ is symmetric and assumed that it is not stochastic and is independent of $\epsilon$. The variance for the coefficient $\hat{\beta}_{k}$ is the $(k, k)$-th element of the covariance matrix.</p> <p>The average distance between the estimator $\hat{\beta}$ and the actual $\beta$ is</p> \begin{align} \mathbb{E}[(\hat{\beta} - \beta)^{T} (\hat{\beta} - \beta)] \nonumber &amp;= \mathbb{E}[((X^{T}X)^{-1} X^{T} \epsilon)^{T} ((X^{T}X)^{-1} X^{T} \epsilon)] \nonumber \\[1em] &amp;= \mathbb{E}[\epsilon^{T} X (X^{T}X)^{-1} (X^{T}X)^{-1} X^{T} \epsilon] \nonumber \\[1em] &amp;= \mathbb{E}[\text{tr}(\epsilon^{T} X (X^{T}X)^{-1} (X^{T}X)^{-1} X^{T} \epsilon)] \nonumber \\[1em] &amp;= \mathbb{E}[\text{tr}(\epsilon \epsilon^{T} X (X^{T}X)^{-1} (X^{T}X)^{-1} X^{T} )] \nonumber \\[1em] &amp;= \sigma^{2} \text{tr}( X (X^{T}X)^{-1} (X^{T}X)^{-1} X^{T} )] \nonumber \\[1em] &amp;= \sigma^{2} \text{tr}( X^{T} X (X^{T}X)^{-1} (X^{T}X)^{-1})] \nonumber \\[1em] &amp;= \sigma^{2} \text{tr}((X^{T}X)^{-1}) \label{eq: distance} \end{align} <p>where we took into account that the distance is a scalar, so its expected value will be equal to its trace. We then used the fact that the trace is invariant with respect to cyclic permutations</p> $\text{tr}(ABCD) = \text{tr}(DABC)$ <p>From \ref{eq: variance} and \ref{eq: distance}, we see that both the variance of the estimator and the distance of the estimator from the actual $\beta$ depend on the matrix $(X^{T}X)^{-1}$.</p> <p>The reason why the variance of the estimator and the distance of the estimator from the actual $\beta$ become large, can be shown conveniently with the help of singular value decomposition. Let</p> <p>\begin{equation} X = U \Sigma V^{T} \end{equation}</p> <p>be the singular value decomposition of $X$ where $\Sigma$ contains all the singular values</p> <p>\begin{equation} \sigma_{1} \geq \sigma_{2} \geq \dots \geq \sigma_{p} &gt; 0 \end{equation}</p> <p>then</p> \begin{align} (X^{T}X)^{-1} &amp;= (V \Sigma^{T} \Sigma V^{T})^{-1} \nonumber \\[1em] &amp;= (V^{T})^{-1} (\Sigma^{T} \Sigma)^{-1} V^{-1} \nonumber \\[1em] &amp;= V (\Sigma^{T} \Sigma)^{-1} V^{T} \nonumber \\[1em] &amp;= \sum_{j = 1}^{p} \frac{1}{\sigma_{j}^{2}} v_{j} v_{j}^{T} \label{eq: svd} \end{align} <p>In the limit</p> <p>\begin{equation} | corr(x_{i}, x_{j}) | \rightarrow 1 \end{equation}</p> <p>matrix $X$ becomes a singular and the smallest singular value vanishes</p> <p>\begin{equation} \sigma_{p} \rightarrow 0 \end{equation}</p> <p>and, from \ref{eq: svd}, also</p> <p>\begin{equation} (X^{T}X)^{-1} \rightarrow \infty \end{equation}</p> <p>Therefore, from \ref{eq: variance} and \ref{eq: distance}, both the variance of the OLS estimator and the distance of the OLS estimator to the actual $\beta$ go to infinity.</p>Marko LalovicFigure 1: As the correlation between regressors increases, the OLS method becomes unstable. In the limit $|corr(x_{i}, x_{j})| \rightarrow 1$, the OLS objective function is no longer strictly convex and OLS solution is no longer unique.