Principal component analysis
I've dabbled in PCA from time to time, and I was familiar with the outline of the process: convert the data to z-scores, calculate the
variance-covariance matrix, and then work with the first few eigenvectors of this matrix. But I'd never understood why it worked – it was
totally mistifying that you'd create a matrix \( C_{ij} \) with terms \( E[(X_i - E(X_i))(X_j - E(X_j))] \), then solve for \( \lambda \) and
\( \bm{v} \) in \( C \bm{v} = \lambda \bm{v} \), and have those \( \bm{v} \)'s actually mean something.
Maybe this was all just staring at me in the Wikipedia article and it'd just never registered in my brain. In any case, the concepts behind PCA are
not difficult (with one exception that I'll handwave over). But there are a lot of pieces to the full explanation, which probably explains why I'd
never been totally satisfied reading other accounts – there was a little bit that I was having trouble with, which either authors skipped
over or which I missed in amongst lots of basic things about eigenvalues that I didn't need to read carefully.
These notes are written largely for an imagined audience similar to a past me – comfortable with matrix algebra, but without solid stats
training. I assume knowledge of variance and covariance; if you don't know what an eigenvector is, then you'll probably find my treatment a little
quick (though you can safely assume that the computer can calculate them for you).
The main steps are shown as bullet points, with explanations that you can show by clicking or tapping the appropriate link.
Let \( X_i \) be the \( N \) random variables under consideration. Let \( C \) be the variance-covariance matrix, i.e.,
\( C_{ij} = \Cov(X_i, X_j) \). Let \( \bm{a} \) be a vector of co-efficients, so that the linear combination of the random variables
\( \sum_i a_i X_i \) can be written as \( \bm{a}^T \bm{X} \).
- The variance of that linear combination is \( \Var(\bm{a}^T \bm{X}) = \bm{a}^T C \bm{a} \). Show.
Starting with the basics: if \( X \) and \( Y \) are random variables, then
\begin{equation*}
\Var(X + Y) = \Var(X) + \Var(Y) + 2\Cov(X, Y).
\end{equation*}
Since \( \Cov(X, X) = \Var(X) \), an equivalent expression for the variance of the sum is
\begin{equation*}
\Var(X + Y) = \Cov(X, X) + \Cov(X, Y) + \Cov(Y, X) + \Cov(Y, Y)
\end{equation*}
More generally, if there are \( N \) random variables \( X_i \), then
\begin{equation*}
\Var\left(\sum_i X_i\right) = \sum_{i=1}^N \sum_{j=1}^N \Cov(X_i, X_j).
\end{equation*}
Now, the covariance is bi-linear, and in particular, \( \Cov(aX, bY) = ab\Cov(X, Y) \). So, the variance of the linear combination
\( \bm{a}^T \bm{X} \) is
\begin{align*}
\Var(\bm{a}^T \bm{X}) &= \Var\left(\sum_{i=1}^N a_i X_i\right) \\
&= \sum_{i=1}^N \sum_{j=1}^N \Cov(a_i X_i, a_j X_j) \\
&= \sum_{i=1}^N \sum_{j=1}^N a_i a_j \Cov(X_i, X_j) \\
&= \sum_{i=1}^N \sum_{j=1}^N a_i a_j C_{ij} \\
&= \bm{a}^T C \bm{a}.
\end{align*}
The data being fed into the covariances is all real, so all entries of \( C \) are real. Symmetry comes from the symmetry of the covariance:
\( \Cov(X, Y) = \Cov(Y, X) \).
A non-zero vector \( \bm{x} \) is an eigenvector of a matrix \( M \) with (scalar) eigenvalue \( \lambda \) if \( M \bm{x} = \lambda \bm{x} \). In general,
eigenvalues and eigenvectors may be complex, even if all entries of \( M \) are real. When working with complex vectors (and matrices), it's useful to
use the complex-conjugate transpose (or "Hermitian conjugate), denoted with a dagger \( ^{\dagger} \), instead of the regular transpose. The reason for
this is that \( \sqrt{\bm{a}^{\dagger} \bm{a}} \) gives a sensible norm for a complex vector \( \bm{a} \), whereas \( \sqrt{\bm{a}^T \bm{a}} \) will usually fail: the norm
of a complex number \( z = a + bi \) is \( \sqrt{a^2 + b^2} \), which is equal to \( \sqrt{zz^*} \) but not equal to \( \sqrt{z^2} \).
An eigenvector can be multiplied by any non-zero constant and it will remain an eigenvector, and it is useful to scale all eigenvectors so that they
have unit norm. Let \( v \) be a normalised eigenvector of \( C \) with eigenvalue \( \lambda \), i.e., \( \bm{v}^{\dagger} \bm{v} = 1 \) and
\( C \bm{v} = \lambda \bm{v} \). Left-multiplying both sides by the Hermitian conjugate \( \bm{v}^{\dagger} \) then gives
\( \bm{v}^{\dagger} C \bm{v} = \lambda \).
Taking the Hermitian conjugate of both sides gives
\begin{align*}
\lambda^* &= (\bm{v}^{\dagger} C \bm{v})^{\dagger} \\
&= \bm{v}^{\dagger} C^{\dagger} \bm{v} \\
&= \bm{v}^{\dagger} C^T \bm{v}.
\end{align*}
(If that's not clear, then work through the algebra element-wise. Since \( C \) is real, its Hermitian conjugate is the regular transpose.)
But since \( C \) is symmetric, it's equal to its transpose, and so \( \lambda^* = \bm{v}^{\dagger} C \bm{v} \). And since
\( \bm{v}^{\dagger} C \bm{v} = \lambda \), it follows that \( \lambda = \lambda^* \), i.e., the eigenvalue \( \lambda \) is real. This argument
applies to any eigenvalue of a real symmetric matrix, so all eigenvalues of \( C \) are real.
Since all eigenvalues and entries of \( C \) are real, all eigenvectors of \( C \) must also be real, and so I'll go back to using regular
transposes from here on.
Here comes the hand-wave. If all eigenvalues of the real symmetric matrix are distinct, then all eigenvectors are mutually orthogonal, and
hence they form a complete basis. The only wrinkle comes from repeated eigenvalues, for which it's not a priori clear that eigenvectors will be
linearly independent. Indeed, for non-symmetric matrices, they may not be; such matrices are called
defective and it's quite cruel of algebra to make many innocuous-looking matrices
defective in this way.
A suggestive argument, which I'm comfortable with because in quantum physics I used to do this often, is to add a small perturbation to the matrix
to break the degeneracy while maintaining the symmetry. Then, instead of repeated eigenvalues, there will eigenvalues that are close in value but
distinct, and the eigenvectors corresponding to these will be orthogonal. Take the limit as the size of the perturbation goes to zero, and they
remain orthogonal.
(This argument doesn't apply to defective matrices, because there is no need for the eigenvectors corresponding to the distinct eigenvalues to be
orthogonal – as the perturbation goes to zero, the offending eigenvectors get closer and closer to pointing in the same direction.)
It's a hand-wave, and not everything in linear algebra behaves nicely when you take a limit (e.g., the
Jordan normal form). But it's OK, because these degeneracies come up in
quantum Hamiltonians all the time, and it always works.
The key ingredients are now all present. The PCA procedure results in a set of (linearly) uncorrelated factors, or principal components. These
factors are linear combinations of the original set of random variables, and there is a direct link between the eigenvectors of the variance-covariance
matrix and these factors: the entries of each eigenvector are the coefficients in the linear combination for a factor, and the orthogonality of
the eigenvectors ensures that the factors are uncorrelated.
The only thing left is to show that there's a usefulness to the particular set of factors generated by PCA, beyond merely being uncorrelated. The
goal of PCA is usually to reduce the dimensionality of the data – instead of working with a large number of variables, it would be easier to
have merely a few factors which capture most of the behaviour. My thought process, though, went from the opposite direction. Suppose that there is an
exact linear dependence in the data, so that, for some non-zero vector \( \bm{a} \), the equation \( \bm{a}^T \bm{X} = \text{const} \) holds always.
Then the output of PCA should make it clear that \( \bm{a}^T \bm{X} \) is a useless factor to study.
The key step (which I had missed in my earlier PCA dabblings) is to take the variance of both sides of the equation \( \bm{a}^T \bm{X} = \text{const} \).
The variance of the constant is zero, and using one of the earlier results, it follows that \( \bm{a}^T C \bm{a} = 0 \).
- If \( \bm{a} \) is non-zero and \( \bm{a}^T C \bm{a} = 0 \), then \( \bm{a} \) is an eigenvector of \( C \) with eigenvalue zero.
Show.
Let \( \bm{v}_i \) be an orthonormal set of eigenvectors of \( C \) with corresponding eigenvalues \( \lambda_i \). Since the \( \bm{v}_i \)
form a complete basis, the vector \( \bm{a} \) can be written as a linear combination of them. i.e., for some non-zero vector of coefficients
\( \bm{b} \),
\begin{equation*}
\bm{a} = \sum_{i=1}^N b_i \bm{v}_i.
\end{equation*}
Substitute this expression into the variance and use the definition of the eigenvector:
\begin{align*}
0 &= \bm{a}^T C \bm{a} \\
&= \sum_{i=1}^N \sum_{j=1}^N b_i \bm{v}_i^T C b_j \bm{v}_j \\
&= \sum_{i=1}^N \sum_{j=1}^N b_i b_j \bm{v}_i^T C \bm{v}_j \\
&= \sum_{i=1}^N \sum_{j=1}^N b_i b_j \bm{v}_i^T \lambda_j \bm{v}_j.
\end{align*}
Now, the eigenvectors are mutually orthogonal, so \( \bm{v}_i^T \bm{v}_j = \delta_{ij} \), the Kronocker-delta function (i.e., 1 if \( i = j \) and
0 otherwise). That reduces the double-sum to a single sum, since only \( i = j \) terms contribute to the total:
\begin{equation*}
0 = \sum_{i=1}^N b_i^2 \lambda_i.
\end{equation*}
Now, the eigenvalues \( \lambda_i \) are all non-negative, and the \( b_i^2 \) are also non-negative. The only way for the sum to equal zero is
if \( b_i = 0 \) for \( \lambda_i > 0 \), with \( b_i \neq 0 \) for at least one \( \lambda_i = 0 \). (There may only be one eigenvector with a zero
eigenvalue, but there may be more.)
In other words, the vector \( \bm{a} \) is a linear combination of the eigenvectors with zero eigenvalues. If there is only one such eigenvector,
then \( \bm{a} \) must be a scalar multiple of it, so \( \bm{a} \) is an eigenvector with eigenvalue zero. If there's more than one such eigenvector,
then let the index \( i' \) range over them. Then
\begin{align*}
C\bm{a} &= \sum_{i'} C b_{i'} \bm{v}_{i'} \\
&= \sum_{i'} b_{i'} \times 0 \times \bm{v}_{i'} \\
&= 0 \\
&= 0 \bm{a},
\end{align*}
so \( \bm{a} \) is an eigenvector of \( C \) with eigenvalue zero.
So now it's clear that if there is an exact linear dependence in the original data, then it'll show up in the form of a zero eigenvalue (and
corresponding eigenvector) of the variance-covariance matrix. The converse almost surely holds* as well: if \( C \bm{v} = 0 \), then
\( 0 = \bm{v}^T C \bm{v} = \Var(\bm{v}^T \bm{X}) \), so that a zero eigenvalue will be associated with a factor with zero variance, i.e. a linear
combination of the original variables that is equal to a constant (with probability 1).
*I can't resist almost surely jokes, and I don't apologise.
A constant factor is pretty useless for studying the variation in a dataset – the other \( N - 1 \) factors are sufficient to re-constitute the
original data. If there are no zero eigenvalues, but instead one very small eigenvalue, the the other \( N - 1 \) factors will only be able to
approximately re-constitute the original data. The hope is (and very often it's true) that the approximation is good enough to do useful study
on the reduced dataset. In fact, it's often the case that you can discard most factors – the ones with the lowest variance – leaving
only a handful of high-variance factors.
- The factor with the most variance corresponds to the largest eigenvalue of \( C \). Show.
That's the central bits of the theory. Since each factor \( F_i = \bm{v}_i^T \bm{X} \), it is useful to define a matrix \( P \) whose rows are the
transposed eigenvectors \( \bm{v}_i^T \). The conversion from original data to factors is then a matrix multiplication, hopefully easy in your
favourite programming language: \( \bm{F} = P\bm{X} \). By the orthonormality of the eigenvectors, the inverse of \( P \) is its transpose, so
the reverse operation goes \( \bm{X} = P^T \bm{F} \).
Usually there's only a small number of factors \( M < N \) that you need to keep – perhaps as many as needed to reproduce 95% of the variance
(as measured by the sum of the eigenvalues included divided by the sum of all eigenvalues). In these cases, the conversion between the \( M \) factors
\( \bm{F} \) and the \( N \) variables \( \bm{X} \) is only approximate. Letting \( \tilde{P} \) be the matrix containing the first \( M \) rows of \( P \),
the conversions are \( \bm{F} = \tilde{P}\bm{X} \) and \( \bm{X} \approx \tilde{P}^T \bm{F} \).
The variances that get calculated throughout PCA are dependent on the scale of each variable. If one of the variables is a height above sea level
measured in metres, then converting to feet will multiply its variance by about 10. But it would be silly to have these carefully constructed
uncorrelated factors be dependent on whether the data's measured in metric or Imperial (or metres instead of centimetres, or...). So it usually makes
sense to convert the data to z-scores: subtract the mean for that variable and then divide by the standard deviation.
(The "mean-centring" step doesn't look strictly necessary to me, despite it being strongly recommended in almost every guide I've seen. Mean-centring
won't change the covariance matrix, and therefore won't change the resulting eigenvalues and eigenvectors. There may be some interpretational
difficulties without mean-centring, I don't know.
Part of the issue is that some stats programs don't actually calculate the variance-covariance matrix, but instead take the \( n \times N \) matrix of
data \( X \) and calculate \( \frac{1}{n-1}X^T X \), which is only equal to the variance-covariance matrix if the means are all zero. This is how R's
prcomp()
function works, and it would almost always be appropriate to subtract the means off (which prcomp()
does by
default.)
And that's the end of my notes on PCA. I often see scatterplots of unscaled data with vectors drawn on them in explanations of the subject; I haven't
drawn any because they don't help me understand why PCA works, as opposed to what it does. Usually my intuition is helped a lot by
a 2D picture, but that hasn't been the case for me here.
Posted 2015-11-16, updated 2015-11-17 (expandedcomment on mean centring).