Direction of maximum variance

Let’s say we have a set of datapoints for which we wish to obtain a “compressed” representation using a line. The line that can best act as a representative for these data points must have the minimum reconstruction error.

For any datapoint $x_{i}$ and the representative line $w$ , the representation of $x_{i}$ on $w$ is the projection of $x_{i}$ on $w$ , i.e. $(x_{i}^{T} w) w$ .

To find such a $w$ , we wish to obtain $ar g min Error(Line,Dataset)$ where,

Error(Line,Dataset) = \frac{1}{n} i = 1 \sum n length (x_{i} - (x_{i}^{T} w) w) = \frac{1}{n} i = 1 \sum n (x_{i} - (x_{i}^{T} w) w)^{T} (x_{i} - (x_{i}^{T} w) w) = \frac{1}{n} i = 1 \sum n [x_{i}^{T} x_{i} - (x_{i}^{T} w)^{2} - (x_{i}^{T} w)^{2} + (x_{i}^{T} w)^{2} w^{T} w] = \frac{1}{n} i = 1 \sum n [x_{i}^{T} x_{i} - (x_{i}^{T} w)^{2}] (If w is a unit vector)

In this above error term, the $x_{i}^{T} x_{i}$ term is independent of $w$ . Thus the error can be re-written as,

Error(Line,Dataset) = \frac{1}{n} i = 1 \sum n - (x_{i}^{T} w)^{2}

And thus the optimization problem can becomes a maximization of error instead.

ar g max Error(Line,Dataset) Error(Line,Dataset) = \frac{1}{n} i = 1 \sum n (x_{i}^{T} w)^{2} = \frac{1}{n} i = 1 \sum n (w^{T} x_{i}) (x_{i}^{T} w) = \frac{1}{n} i = 1 \sum n w^{T} (x_{i} x_{i}^{T}) w = w^{T} (\frac{1}{n} i = 1 \sum n (x_{i} x_{i}^{T})) w = w^{T} Σ_{x} w

Where $Σ_{x}$ is the covariance matrix for the random vector $[x_{1}, x_{2}, \dots, x_{n}]$ . If we recall the properties of the covariance matrix, we can see that the above form of the error function heavily resembles the variance of a random vector.

Also notice how the formula we arrive at is similar to the covariance formula for a random vector, with just the mean being 0. This means that centering the data is essential for variance maximization.

Thus the best representative line $w$ will be the one that captures that maximum variance in the dataset.

Residual Analysis

The residue left after determining a representation for $x_{i}$ need not be error. It can still hold some information crucial for a full reconstruction of the original dataset.

Some observation about the residues -

All residues are orthogonal to the first representation $w_{1}$ .
Any line which minimizes the sum of errors w.r.t the residuals must also be orthogonal to $w_{1}$ as all such lines would lie in the orthogonal complement of $w_{1}$ .

If ${x_{1}, x_{2}, \dots, x_{n}}$ are our original datapoints, the residues left after representing them using $w_{1}$ would be

{x_{1} - (x_{1}^{T} w_{1}) w_{1}, x_{2} - (x_{2}^{T} w_{1}) w_{1}, \dots, x_{n} - (x_{n}^{T} w_{1}) w_{1}}

We can then find another line in the orthogonal complement of $w_{1}$ which would minimize the sum of errors w.r.t these residuals and yield us new residues $-$

\Rightarrow \Rightarrow {x_{1} - (x_{1}^{T} w_{1}) w_{1} - ((x_{1} - (x_{1}^{T} w_{1}) w_{1})^{T} w_{2}) w_{2}} {x_{1} - (x_{1}^{T} w_{1}) w_{1} - (x_{1}^{T} w_{2} - (x_{1}^{T} w_{1}) w_{1}^{T} w_{2}) w_{2}} {x_{1} - (x_{1}^{T} w_{1}) w_{1} - (x_{1}^{T} w_{2}) w_{2}}

Thus the residue after $d$ rounds would be,

{x_{1} - i = 1 \sum d (x_{1}^{T} w_{1}) w_{1} = 0}

because after $d$ rounds, we’d be left with $d$ orthonormal basis which would span all of $R^{d}$ .

Note - If the datapoints lie in a low dimensional subspace of $R^{d}$ then the residues would become 0 much earlier than $d$ rounds.

For any $w \in R^{d}$ , such that $∣∣ w ∣ ∣_{2}^{2} = 1$

\Rightarrow ∣∣ x_{i} ∣ ∣^{2} ∣∣ x_{i} ∣ ∣^{2} = ∣∣ x_{i} - (x_{i}^{T} w) ∣ ∣^{2} + ∣∣ (x_{i}^{T} w) w ∣ ∣^{2} = Error ∣∣ x_{i} - (x_{i}^{T} w) ∣ ∣^{2} + Representation ∣∣ (x_{i}^{T} w) w ∣ ∣^{2}

We want this representation term to be as large as possible for a better fit.

Eigenvalues of the Covariance Matrix

The optimization problem discussed above can be solved using the Hilbert’s Min-Max Theorem. This tells us that $w_{1}$ is the eigenvector of $Σ_{x}$ corresponding to the largest eigenvalue of $Σ_{x}$ .

If $w_{1}$ is an eigenvector of $Σ_{x}$ , we can say that

\Rightarrow \Rightarrow \Rightarrow \Rightarrow Σ_{x} w_{1} w_{1}^{T} Σ_{x} w_{1} w_{1}^{T} Σ_{x} w_{1} λ_{1} λ_{1} = λ_{1} w_{1} = w_{1}^{T} λ_{1} w_{1} = λ_{1} = w_{1}^{T} \frac{1}{n} i = 1 \sum n (x_{i} x_{i}^{T}) w_{1} = \frac{1}{n} i = 1 \sum n (x_{i}^{T} w_{1})^{2} ∵ w^{T} w = 1

This is exactly the optimization error term we used earlier to restructure the error as variance. Thus we can say that the largest eigenvalue of the covariance matrix is literally the variance of the data projected onto $w_{1}$ .

Because the covariance matrix is symmetric, the eigenvectors for it will be orthogonal. Thus the eigenvector $w_{2}$ corresponding to the second highest eigenvalue $λ_{2}$ will naturally lie in the orthogonal complement of $w_{1}$ and $λ_{2}$ would correspond to the variance of the data projected onto $w_{2}$ .

Sequentially, all eigenvalues of the covariance matrix would in-turn correspond to the variance of the data projected onto the eigenvector corresponding to them.

Rule of thumb for dimensions

The most common assumption is that any data point $x_{i}$ is made up of two components. The actual signal $s_{i}$ and the random noise $ϵ_{i}$ . Because this noise is random in nature, it doesn’t have a particular pattern which it follows. It has not preferred direction, tends to be small in magnitude, and contributes a small and roughly equal amount of variance in every direction.

Thus by looking at the eigenvalues of our covariance matrix, we can say that the smaller eigenvalues may correspond to this random noise and the eigenvectors corresponding to these eigenvalues are redundant for a proper reconstruction of the original data. So instead of trying to capture all the variance in our original data, we can only use some top $k$ eigenvalues and their corresponding eigenvectors to capture some $t$ threshold of variance, typically 0.95.

We can identify our top $k$ directions by doing $-$

\frac{\sum _{i = 1}^{k} λ _{i} ( Σ _{x} )}{\sum _{i = 1}^{d} λ _{i} ( Σ _{x} )} \geq 0.95

Because all eigenvalues correspond to variance which is non-negative, this summation will be non-decreasing.

Let the green line represent $w_{1}$ and the blue line represent $w_{2}$ . Notice how the variance of the data along $w_{1}$ is much more than the variance of the data on $w_{2}$ .

GATE & BS Notes

Explorer

Principal Component Analysis

Direction of maximum variance

Residual Analysis

Eigenvalues of the Covariance Matrix

Rule of thumb for dimensions

Graph View

Table of Contents