Principal component analysis

I've dabbled in PCA from time to time, and I was familiar with the outline of the process: convert the data to z-scores, calculate the variance-covariance matrix, and then work with the first few eigenvectors of this matrix. But I'd never understood why it worked – it was totally mistifying that you'd create a matrix \( C_{ij} \) with terms \( E[(X_i - E(X_i))(X_j - E(X_j))] \), then solve for \( \lambda \) and \( \bm{v} \) in \( C \bm{v} = \lambda \bm{v} \), and have those \( \bm{v} \)'s actually mean something.

Maybe this was all just staring at me in the Wikipedia article and it'd just never registered in my brain. In any case, the concepts behind PCA are not difficult (with one exception that I'll handwave over). But there are a lot of pieces to the full explanation, which probably explains why I'd never been totally satisfied reading other accounts – there was a little bit that I was having trouble with, which either authors skipped over or which I missed in amongst lots of basic things about eigenvalues that I didn't need to read carefully.

These notes are written largely for an imagined audience similar to a past me – comfortable with matrix algebra, but without solid stats training. I assume knowledge of variance and covariance; if you don't know what an eigenvector is, then you'll probably find my treatment a little quick (though you can safely assume that the computer can calculate them for you).

The main steps are shown as bullet points, with explanations that you can show by clicking or tapping the appropriate link.

Let \( X_i \) be the \( N \) random variables under consideration. Let \( C \) be the variance-covariance matrix, i.e., \( C_{ij} = \Cov(X_i, X_j) \). Let \( \bm{a} \) be a vector of co-efficients, so that the linear combination of the random variables \( \sum_i a_i X_i \) can be written as \( \bm{a}^T \bm{X} \).

The variance of that linear combination is \( \Var(\bm{a}^T \bm{X}) = \bm{a}^T C \bm{a} \). Show.

Starting with the basics: if \( X \) and \( Y \) are random variables, then

\begin{equation*} \Var(X + Y) = \Var(X) + \Var(Y) + 2\Cov(X, Y). \end{equation*}

Since \( \Cov(X, X) = \Var(X) \), an equivalent expression for the variance of the sum is

\begin{equation*} \Var(X + Y) = \Cov(X, X) + \Cov(X, Y) + \Cov(Y, X) + \Cov(Y, Y) \end{equation*}

More generally, if there are \( N \) random variables \( X_i \), then

\begin{equation*} \Var\left(\sum_i X_i\right) = \sum_{i=1}^N \sum_{j=1}^N \Cov(X_i, X_j). \end{equation*}

Now, the covariance is bi-linear, and in particular, \( \Cov(aX, bY) = ab\Cov(X, Y) \). So, the variance of the linear combination \( \bm{a}^T \bm{X} \) is

\begin{align*} \Var(\bm{a}^T \bm{X}) &= \Var\left(\sum_{i=1}^N a_i X_i\right) \\ &= \sum_{i=1}^N \sum_{j=1}^N \Cov(a_i X_i, a_j X_j) \\ &= \sum_{i=1}^N \sum_{j=1}^N a_i a_j \Cov(X_i, X_j) \\ &= \sum_{i=1}^N \sum_{j=1}^N a_i a_j C_{ij} \\ &= \bm{a}^T C \bm{a}. \end{align*}

The covariance of a pair of linear combinations \( \bm{a}^T \bm{X} \) and \( \bm{b}^T \bm{X} \) is \( \Cov(\bm{a}^T \bm{X}, \bm{b}^T \bm{X}) = \bm{a}^T C \bm{b} \). Show.

The matrix \( C \) is real and symmetric, so all of its eigenvalues are real. Show.

The data being fed into the covariances is all real, so all entries of \( C \) are real. Symmetry comes from the symmetry of the covariance: \( \Cov(X, Y) = \Cov(Y, X) \).

A non-zero vector \( \bm{x} \) is an eigenvector of a matrix \( M \) with (scalar) eigenvalue \( \lambda \) if \( M \bm{x} = \lambda \bm{x} \). In general, eigenvalues and eigenvectors may be complex, even if all entries of \( M \) are real. When working with complex vectors (and matrices), it's useful to use the complex-conjugate transpose (or "Hermitian conjugate), denoted with a dagger \( ^{\dagger} \), instead of the regular transpose. The reason for this is that \( \sqrt{\bm{a}^{\dagger} \bm{a}} \) gives a sensible norm for a complex vector \( \bm{a} \), whereas \( \sqrt{\bm{a}^T \bm{a}} \) will usually fail: the norm of a complex number \( z = a + bi \) is \( \sqrt{a^2 + b^2} \), which is equal to \( \sqrt{zz^*} \) but not equal to \( \sqrt{z^2} \).

An eigenvector can be multiplied by any non-zero constant and it will remain an eigenvector, and it is useful to scale all eigenvectors so that they have unit norm. Let \( v \) be a normalised eigenvector of \( C \) with eigenvalue \( \lambda \), i.e., \( \bm{v}^{\dagger} \bm{v} = 1 \) and \( C \bm{v} = \lambda \bm{v} \). Left-multiplying both sides by the Hermitian conjugate \( \bm{v}^{\dagger} \) then gives \( \bm{v}^{\dagger} C \bm{v} = \lambda \).

Taking the Hermitian conjugate of both sides gives

\begin{align*} \lambda^* &= (\bm{v}^{\dagger} C \bm{v})^{\dagger} \\ &= \bm{v}^{\dagger} C^{\dagger} \bm{v} \\ &= \bm{v}^{\dagger} C^T \bm{v}. \end{align*}

(If that's not clear, then work through the algebra element-wise. Since \( C \) is real, its Hermitian conjugate is the regular transpose.)

But since \( C \) is symmetric, it's equal to its transpose, and so \( \lambda^* = \bm{v}^{\dagger} C \bm{v} \). And since \( \bm{v}^{\dagger} C \bm{v} = \lambda \), it follows that \( \lambda = \lambda^* \), i.e., the eigenvalue \( \lambda \) is real. This argument applies to any eigenvalue of a real symmetric matrix, so all eigenvalues of \( C \) are real.

Since all eigenvalues and entries of \( C \) are real, all eigenvectors of \( C \) must also be real, and so I'll go back to using regular transposes from here on.

The matrix \( C \) is positive semi-definite, i.e., its eigenvalues are all non-negative. Show.

Eigenvectors corresponding to distinct eigenvalues of a real symmetric matrix are orthogonal. Show.

I'll use the matrix \( C \) here, but the argument applies more generally. Since \( C \) is symmetric, an eigenvector is also a left-eigenvector, i.e., if \( C \bm{v} = \lambda \bm{v} \), then \( \bm{v}^T C = \lambda \bm{v}^T \). Now, if \( \bm{v}_1 \) and \( \bm{v}_2 \) are eigenvectors of \( C \) with corresponding distinct eigenvalues \( \lambda_1 \) and \( \lambda_2 \), then \( \bm{v}_2^T C \bm{v}_1 = \bm{v}_2^T \lambda_1\bm{v}_1 \) and also \( \bm{v}_2^T C \bm{v}_1 = \bm{v}_2^T \lambda_2\bm{v}_1 \), so \( \lambda_1 \bm{v}_2^T \bm{v}_1 = \lambda_2 \bm{v}_2^T \bm{v}_1 \), or \( (\lambda_1 - \lambda_2) \bm{v}_2^T \bm{v}_1 = 0 \).

By assumption, the two eigenvalues are distinct, so \( \lambda_1 - \lambda_2 \neq 0 \), and so it must be that \( \bm{v}_2^T \bm{v}_1 = 0 \), i.e., the two eigenvectors are orthogonal.

It is possible to construct a complete orthonormal basis from the eigenvectors of a real symmetric matrix. Show.

Here comes the hand-wave. If all eigenvalues of the real symmetric matrix are distinct, then all eigenvectors are mutually orthogonal, and hence they form a complete basis. The only wrinkle comes from repeated eigenvalues, for which it's not a priori clear that eigenvectors will be linearly independent. Indeed, for non-symmetric matrices, they may not be; such matrices are called defective and it's quite cruel of algebra to make many innocuous-looking matrices defective in this way.

A suggestive argument, which I'm comfortable with because in quantum physics I used to do this often, is to add a small perturbation to the matrix to break the degeneracy while maintaining the symmetry. Then, instead of repeated eigenvalues, there will eigenvalues that are close in value but distinct, and the eigenvectors corresponding to these will be orthogonal. Take the limit as the size of the perturbation goes to zero, and they remain orthogonal.

(This argument doesn't apply to defective matrices, because there is no need for the eigenvectors corresponding to the distinct eigenvalues to be orthogonal – as the perturbation goes to zero, the offending eigenvectors get closer and closer to pointing in the same direction.)

It's a hand-wave, and not everything in linear algebra behaves nicely when you take a limit (e.g., the Jordan normal form). But it's OK, because these degeneracies come up in quantum Hamiltonians all the time, and it always works.

If \( \bm{v}_1 \) and \( \bm{v}_2 \) are orthogonal eigenvectors of \( C \), then the linear combinations \( \bm{v}_1^T \bm{X} \) and \( \bm{v}_2^T \bm{X} \) are uncorrelated. Show.

The key ingredients are now all present. The PCA procedure results in a set of (linearly) uncorrelated factors, or principal components. These factors are linear combinations of the original set of random variables, and there is a direct link between the eigenvectors of the variance-covariance matrix and these factors: the entries of each eigenvector are the coefficients in the linear combination for a factor, and the orthogonality of the eigenvectors ensures that the factors are uncorrelated.

The only thing left is to show that there's a usefulness to the particular set of factors generated by PCA, beyond merely being uncorrelated. The goal of PCA is usually to reduce the dimensionality of the data – instead of working with a large number of variables, it would be easier to have merely a few factors which capture most of the behaviour. My thought process, though, went from the opposite direction. Suppose that there is an exact linear dependence in the data, so that, for some non-zero vector \( \bm{a} \), the equation \( \bm{a}^T \bm{X} = \text{const} \) holds always. Then the output of PCA should make it clear that \( \bm{a}^T \bm{X} \) is a useless factor to study.

The key step (which I had missed in my earlier PCA dabblings) is to take the variance of both sides of the equation \( \bm{a}^T \bm{X} = \text{const} \). The variance of the constant is zero, and using one of the earlier results, it follows that \( \bm{a}^T C \bm{a} = 0 \).

If \( \bm{a} \) is non-zero and \( \bm{a}^T C \bm{a} = 0 \), then \( \bm{a} \) is an eigenvector of \( C \) with eigenvalue zero. Show.

Let \( \bm{v}_i \) be an orthonormal set of eigenvectors of \( C \) with corresponding eigenvalues \( \lambda_i \). Since the \( \bm{v}_i \) form a complete basis, the vector \( \bm{a} \) can be written as a linear combination of them. i.e., for some non-zero vector of coefficients \( \bm{b} \),

\begin{equation*} \bm{a} = \sum_{i=1}^N b_i \bm{v}_i. \end{equation*}

Substitute this expression into the variance and use the definition of the eigenvector:

\begin{align*} 0 &= \bm{a}^T C \bm{a} \\ &= \sum_{i=1}^N \sum_{j=1}^N b_i \bm{v}_i^T C b_j \bm{v}_j \\ &= \sum_{i=1}^N \sum_{j=1}^N b_i b_j \bm{v}_i^T C \bm{v}_j \\ &= \sum_{i=1}^N \sum_{j=1}^N b_i b_j \bm{v}_i^T \lambda_j \bm{v}_j. \end{align*}

Now, the eigenvectors are mutually orthogonal, so \( \bm{v}_i^T \bm{v}_j = \delta_{ij} \), the Kronocker-delta function (i.e., 1 if \( i = j \) and 0 otherwise). That reduces the double-sum to a single sum, since only \( i = j \) terms contribute to the total:

\begin{equation*} 0 = \sum_{i=1}^N b_i^2 \lambda_i. \end{equation*}

Now, the eigenvalues \( \lambda_i \) are all non-negative, and the \( b_i^2 \) are also non-negative. The only way for the sum to equal zero is if \( b_i = 0 \) for \( \lambda_i > 0 \), with \( b_i \neq 0 \) for at least one \( \lambda_i = 0 \). (There may only be one eigenvector with a zero eigenvalue, but there may be more.)

In other words, the vector \( \bm{a} \) is a linear combination of the eigenvectors with zero eigenvalues. If there is only one such eigenvector, then \( \bm{a} \) must be a scalar multiple of it, so \( \bm{a} \) is an eigenvector with eigenvalue zero. If there's more than one such eigenvector, then let the index \( i' \) range over them. Then

\begin{align*} C\bm{a} &= \sum_{i'} C b_{i'} \bm{v}_{i'} \\ &= \sum_{i'} b_{i'} \times 0 \times \bm{v}_{i'} \\ &= 0 \\ &= 0 \bm{a}, \end{align*}

so \( \bm{a} \) is an eigenvector of \( C \) with eigenvalue zero.

So now it's clear that if there is an exact linear dependence in the original data, then it'll show up in the form of a zero eigenvalue (and corresponding eigenvector) of the variance-covariance matrix. The converse almost surely holds^* as well: if \( C \bm{v} = 0 \), then \( 0 = \bm{v}^T C \bm{v} = \Var(\bm{v}^T \bm{X}) \), so that a zero eigenvalue will be associated with a factor with zero variance, i.e. a linear combination of the original variables that is equal to a constant (with probability 1).

^*I can't resist almost surely jokes, and I don't apologise.

A constant factor is pretty useless for studying the variation in a dataset – the other \( N - 1 \) factors are sufficient to re-constitute the original data. If there are no zero eigenvalues, but instead one very small eigenvalue, the the other \( N - 1 \) factors will only be able to approximately re-constitute the original data. The hope is (and very often it's true) that the approximation is good enough to do useful study on the reduced dataset. In fact, it's often the case that you can discard most factors – the ones with the lowest variance – leaving only a handful of high-variance factors.

The factor with the most variance corresponds to the largest eigenvalue of \( C \). Show.

Let \( \lambda_1 \) be the largest eigenvalue of \( C \). Following the same train of thought as in the previous explanation block, write a (normalised) vector \( \bm{a} \) in terms of the eigenvector basis \( \bm{v}_i \), so that \( \bm{a} = \sum_{i=1}^N b_i \bm{v}_i \) and \( \Var(\bm{a}^T \bm{X}) = \sum_{i=1}^N b_i^2 \lambda_i \).

Since \( \bm{a} \) has unit norm, none of the \( b_i \) can be greater than 1. The greatest variance for the factor \( \bm{a}^T \bm{X} \) must therefore come from giving full weight \( b_1 = 1 \) to the dominant eigenvector \( \bm{v}_1 \) and zero weight to all other eigenvectors. i.e., the factor with the greatest variance corresponds to the largest eigenvalue.

That's the central bits of the theory. Since each factor \( F_i = \bm{v}_i^T \bm{X} \), it is useful to define a matrix \( P \) whose rows are the transposed eigenvectors \( \bm{v}_i^T \). The conversion from original data to factors is then a matrix multiplication, hopefully easy in your favourite programming language: \( \bm{F} = P\bm{X} \). By the orthonormality of the eigenvectors, the inverse of \( P \) is its transpose, so the reverse operation goes \( \bm{X} = P^T \bm{F} \).

Usually there's only a small number of factors \( M < N \) that you need to keep – perhaps as many as needed to reproduce 95% of the variance (as measured by the sum of the eigenvalues included divided by the sum of all eigenvalues). In these cases, the conversion between the \( M \) factors \( \bm{F} \) and the \( N \) variables \( \bm{X} \) is only approximate. Letting \( \tilde{P} \) be the matrix containing the first \( M \) rows of \( P \), the conversions are \( \bm{F} = \tilde{P}\bm{X} \) and \( \bm{X} \approx \tilde{P}^T \bm{F} \).

The variances that get calculated throughout PCA are dependent on the scale of each variable. If one of the variables is a height above sea level measured in metres, then converting to feet will multiply its variance by about 10. But it would be silly to have these carefully constructed uncorrelated factors be dependent on whether the data's measured in metric or Imperial (or metres instead of centimetres, or...). So it usually makes sense to convert the data to z-scores: subtract the mean for that variable and then divide by the standard deviation.

(The "mean-centring" step doesn't look strictly necessary to me, despite it being strongly recommended in almost every guide I've seen. Mean-centring won't change the covariance matrix, and therefore won't change the resulting eigenvalues and eigenvectors. There may be some interpretational difficulties without mean-centring, I don't know.

Part of the issue is that some stats programs don't actually calculate the variance-covariance matrix, but instead take the \( n \times N \) matrix of data \( X \) and calculate \( \frac{1}{n-1}X^T X \), which is only equal to the variance-covariance matrix if the means are all zero. This is how R's prcomp() function works, and it would almost always be appropriate to subtract the means off (which prcomp() does by default.)

And that's the end of my notes on PCA. I often see scatterplots of unscaled data with vectors drawn on them in explanations of the subject; I haven't drawn any because they don't help me understand why PCA works, as opposed to what it does. Usually my intuition is helped a lot by a 2D picture, but that hasn't been the case for me here.

Posted 2015-11-16, updated 2015-11-17 (expandedcomment on mean centring).