Home > Misc $$\newcommand{\bm}{\mathbf} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov}$$

## Principal component analysis

I've dabbled in PCA from time to time, and I was familiar with the outline of the process: convert the data to z-scores, calculate the variance-covariance matrix, and then work with the first few eigenvectors of this matrix. But I'd never understood why it worked – it was totally mistifying that you'd create a matrix $$C_{ij}$$ with terms $$E[(X_i - E(X_i))(X_j - E(X_j))]$$, then solve for $$\lambda$$ and $$\bm{v}$$ in $$C \bm{v} = \lambda \bm{v}$$, and have those $$\bm{v}$$'s actually mean something.

Maybe this was all just staring at me in the Wikipedia article and it'd just never registered in my brain. In any case, the concepts behind PCA are not difficult (with one exception that I'll handwave over). But there are a lot of pieces to the full explanation, which probably explains why I'd never been totally satisfied reading other accounts – there was a little bit that I was having trouble with, which either authors skipped over or which I missed in amongst lots of basic things about eigenvalues that I didn't need to read carefully.

These notes are written largely for an imagined audience similar to a past me – comfortable with matrix algebra, but without solid stats training. I assume knowledge of variance and covariance; if you don't know what an eigenvector is, then you'll probably find my treatment a little quick (though you can safely assume that the computer can calculate them for you).

The main steps are shown as bullet points, with explanations that you can show by clicking or tapping the appropriate link.

Let $$X_i$$ be the $$N$$ random variables under consideration. Let $$C$$ be the variance-covariance matrix, i.e., $$C_{ij} = \Cov(X_i, X_j)$$. Let $$\bm{a}$$ be a vector of co-efficients, so that the linear combination of the random variables $$\sum_i a_i X_i$$ can be written as $$\bm{a}^T \bm{X}$$.

• The variance of that linear combination is $$\Var(\bm{a}^T \bm{X}) = \bm{a}^T C \bm{a}$$. Show.
• The covariance of a pair of linear combinations $$\bm{a}^T \bm{X}$$ and $$\bm{b}^T \bm{X}$$ is $$\Cov(\bm{a}^T \bm{X}, \bm{b}^T \bm{X}) = \bm{a}^T C \bm{b}$$. Show.
• The matrix $$C$$ is real and symmetric, so all of its eigenvalues are real. Show.
• The matrix $$C$$ is positive semi-definite, i.e., its eigenvalues are all non-negative. Show.
• Eigenvectors corresponding to distinct eigenvalues of a real symmetric matrix are orthogonal. Show.
• It is possible to construct a complete orthonormal basis from the eigenvectors of a real symmetric matrix. Show.
• If $$\bm{v}_1$$ and $$\bm{v}_2$$ are orthogonal eigenvectors of $$C$$, then the linear combinations $$\bm{v}_1^T \bm{X}$$ and $$\bm{v}_2^T \bm{X}$$ are uncorrelated. Show.

The key ingredients are now all present. The PCA procedure results in a set of (linearly) uncorrelated factors, or principal components. These factors are linear combinations of the original set of random variables, and there is a direct link between the eigenvectors of the variance-covariance matrix and these factors: the entries of each eigenvector are the coefficients in the linear combination for a factor, and the orthogonality of the eigenvectors ensures that the factors are uncorrelated.

The only thing left is to show that there's a usefulness to the particular set of factors generated by PCA, beyond merely being uncorrelated. The goal of PCA is usually to reduce the dimensionality of the data – instead of working with a large number of variables, it would be easier to have merely a few factors which capture most of the behaviour. My thought process, though, went from the opposite direction. Suppose that there is an exact linear dependence in the data, so that, for some non-zero vector $$\bm{a}$$, the equation $$\bm{a}^T \bm{X} = \text{const}$$ holds always. Then the output of PCA should make it clear that $$\bm{a}^T \bm{X}$$ is a useless factor to study.

The key step (which I had missed in my earlier PCA dabblings) is to take the variance of both sides of the equation $$\bm{a}^T \bm{X} = \text{const}$$. The variance of the constant is zero, and using one of the earlier results, it follows that $$\bm{a}^T C \bm{a} = 0$$.

• If $$\bm{a}$$ is non-zero and $$\bm{a}^T C \bm{a} = 0$$, then $$\bm{a}$$ is an eigenvector of $$C$$ with eigenvalue zero. Show.

So now it's clear that if there is an exact linear dependence in the original data, then it'll show up in the form of a zero eigenvalue (and corresponding eigenvector) of the variance-covariance matrix. The converse almost surely holds* as well: if $$C \bm{v} = 0$$, then $$0 = \bm{v}^T C \bm{v} = \Var(\bm{v}^T \bm{X})$$, so that a zero eigenvalue will be associated with a factor with zero variance, i.e. a linear combination of the original variables that is equal to a constant (with probability 1).

*I can't resist almost surely jokes, and I don't apologise.

A constant factor is pretty useless for studying the variation in a dataset – the other $$N - 1$$ factors are sufficient to re-constitute the original data. If there are no zero eigenvalues, but instead one very small eigenvalue, the the other $$N - 1$$ factors will only be able to approximately re-constitute the original data. The hope is (and very often it's true) that the approximation is good enough to do useful study on the reduced dataset. In fact, it's often the case that you can discard most factors – the ones with the lowest variance – leaving only a handful of high-variance factors.

• The factor with the most variance corresponds to the largest eigenvalue of $$C$$. Show.

That's the central bits of the theory. Since each factor $$F_i = \bm{v}_i^T \bm{X}$$, it is useful to define a matrix $$P$$ whose rows are the transposed eigenvectors $$\bm{v}_i^T$$. The conversion from original data to factors is then a matrix multiplication, hopefully easy in your favourite programming language: $$\bm{F} = P\bm{X}$$. By the orthonormality of the eigenvectors, the inverse of $$P$$ is its transpose, so the reverse operation goes $$\bm{X} = P^T \bm{F}$$.

Usually there's only a small number of factors $$M < N$$ that you need to keep – perhaps as many as needed to reproduce 95% of the variance (as measured by the sum of the eigenvalues included divided by the sum of all eigenvalues). In these cases, the conversion between the $$M$$ factors $$\bm{F}$$ and the $$N$$ variables $$\bm{X}$$ is only approximate. Letting $$\tilde{P}$$ be the matrix containing the first $$M$$ rows of $$P$$, the conversions are $$\bm{F} = \tilde{P}\bm{X}$$ and $$\bm{X} \approx \tilde{P}^T \bm{F}$$.

The variances that get calculated throughout PCA are dependent on the scale of each variable. If one of the variables is a height above sea level measured in metres, then converting to feet will multiply its variance by about 10. But it would be silly to have these carefully constructed uncorrelated factors be dependent on whether the data's measured in metric or Imperial (or metres instead of centimetres, or...). So it usually makes sense to convert the data to z-scores: subtract the mean for that variable and then divide by the standard deviation.

(The "mean-centring" step doesn't look strictly necessary to me, despite it being strongly recommended in almost every guide I've seen. Mean-centring won't change the covariance matrix, and therefore won't change the resulting eigenvalues and eigenvectors. There may be some interpretational difficulties without mean-centring, I don't know.

Part of the issue is that some stats programs don't actually calculate the variance-covariance matrix, but instead take the $$n \times N$$ matrix of data $$X$$ and calculate $$\frac{1}{n-1}X^T X$$, which is only equal to the variance-covariance matrix if the means are all zero. This is how R's prcomp() function works, and it would almost always be appropriate to subtract the means off (which prcomp() does by default.)

And that's the end of my notes on PCA. I often see scatterplots of unscaled data with vectors drawn on them in explanations of the subject; I haven't drawn any because they don't help me understand why PCA works, as opposed to what it does. Usually my intuition is helped a lot by a 2D picture, but that hasn't been the case for me here.

Posted 2015-11-16, updated 2015-11-17 (expandedcomment on mean centring).

Home > Misc