Skip to main content
A scattered cloud of data points with two arrows through its centre showing the long and short axes of variance.
From the blog · by Ali Jabbary

Eigenvalues, SVD and PCA: the linear algebra ML actually uses

Ali Jabbary
Ali Jabbary
M.Sc., P.Eng.
9 min read
#linear-algebra#svd#pca#eigenvalues#machine-learning

Article Summary

You call sklearn's PCA but can't quite say what it does. Here's the intuition — eigenvectors, SVD, and how PCA falls out of both — with a NumPy demo to prove it.

A student pasted me four lines last month and asked why they worked:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced = pca.fit_transform(X)

His 50-column dataset came out the other side with 2 columns, his model trained faster, and accuracy barely moved. "It clearly does something useful," he said, "but I genuinely could not tell you what." Which is a completely honest place to be, and a slightly dangerous one — because the day PCA does something surprising (and it will), you need a mental model to debug it, not a magic spell.

So this is the explanation I give in sessions. Almost no equations you can't do in your head; a lot of pictures-in-words. By the end you'll know what an eigenvector is, why PCA is just eigenvectors of a particular matrix, why SVD is the more honest way to compute it, and you'll watch a short block of NumPy reproduce sklearn exactly. The whole thing rests on one idea you might already have — the dot product as a measure of alignment — which I've written about separately as "the whole game". I'll lean on it here without re-deriving it.

An eigenvector is "the direction that doesn't get knocked off course"

Forget the word for a second. A matrix, applied to a vector, is a transformation — it stretches, squishes, and rotates space. Picture a rubber sheet with an arrow drawn on it. You grab the sheet and deform it. Most arrows come out pointing somewhere new — they got rotated.

But for almost any such deformation, there are a few special directions where the arrow comes out pointing exactly the same way (or exactly backwards). It got longer or shorter, but it didn't turn. Those special directions are the eigenvectors. The factor by which each one stretched is its eigenvalue. That's the entire concept:

A @ v = λ * v

"Applying the matrix A to the special vector v is the same as just scaling v by the number λ." The transformation, along that direction, is pure stretch — no rotation. A big eigenvalue means "this direction gets stretched a lot"; a small one means "barely touched."

That single picture — the direction the transformation doesn't rotate, and how hard it pulls along it — is enough to understand everything that follows. Hold onto it.

PCA: which directions does the data spread out along?

Now the data problem. You've got a cloud of points in many dimensions, and most real datasets are not a featureless round blob. They're stretched — long and thin along some directions, flat along others. Two of your 50 columns might be height-in-cm and height-in-inches: technically two dimensions, but the cloud is a near-perfect diagonal line through that plane. There's really only one direction of interest there; the other is noise around it.

PCA asks a beautifully simple question: what are the directions along which my data spreads out the most? Find the long axis of the cloud. Then the next-longest axis at right angles to it. And so on. Those axes are the principal components. Keep the top few, throw away the rest, and you've compressed the data while losing as little of its spread as you can.

So how do we find the long axis? Spread is measured by variance, and the object that holds all the variance information for a multi-dimensional cloud is the covariance matrix.

The covariance matrix, and the punchline

Center your data first — subtract the mean of each column so the cloud sits at the origin. (This matters; PCA is about spread around the center, so the center has to be at zero. Forgetting this is the single most common PCA bug I see.) Call the centered data X. The covariance matrix is, up to a constant:

C = (1 / (n - 1)) * Xᵀ @ X

It's a square matrix, one row/column per feature. The diagonal entries are each feature's own variance — how much it spreads on its own. The off-diagonal entries are covariances — how much pairs of features move together. (Height-in-cm and height-in-inches would have a huge off-diagonal entry, because they move in lockstep.)

Here's the punchline, and it's worth saying slowly. The eigenvectors of the covariance matrix are the principal components, and each eigenvalue is the variance of the data along its eigenvector.

Read that again with the eigenvector picture in hand. The covariance matrix encodes the shape of the spread. Its eigenvectors are the directions that matrix doesn't rotate — which turn out to be exactly the axes the cloud is naturally aligned to. The biggest eigenvalue points along the direction of greatest variance: the long axis. PCA is eigen-decomposition of the covariance matrix. That's the whole thing. Everything else is implementation.

SVD: the master decomposition that does it more honestly

If PCA is just eigenvectors of C, why does sklearn's docstring say it uses the Singular Value Decomposition instead? Because forming C = Xᵀ @ X and then eigen-decomposing it is the mathematically clean route and the numerically clumsy one. Squaring the data to build Xᵀ @ X squares the spread between your largest and smallest scales, which is exactly the recipe for losing precision. SVD gets the same answer while working on X directly, and it's more stable.

SVD says: any matrix X whatsoever — rectangular, ugly, whatever — can be broken into three pieces:

X = U @ Σ @ Vᵀ

The geometric reading is the one to keep. Any linear transformation, no matter how complicated, is really just rotate, then scale along the axes, then rotate again:

  • Vᵀ is the first rotation — it lines the input up with a clean set of axes.
  • Σ (a diagonal matrix of singular values) is the pure scaling — stretch each lined-up axis by some amount. The singular values come out sorted biggest-first.
  • U is the final rotation into the output space.

That's a strong claim and it's true for every matrix in existence. Rotate, stretch, rotate. SVD is called the "master decomposition" because almost every other matrix factorization is a special case or a consequence of it.

And the connection to PCA is direct: the columns of V (rows of Vᵀ) are the principal components, and the singular values are the square roots of the variances (up to the n-1 scaling). You never have to form the covariance matrix at all. SVD reads the principal directions straight off X. That's why sklearn — and most serious numerical libraries — compute PCA via SVD: same eigenvectors, better arithmetic.

A note on what each piece "means" in ML terms

Piece Shape (for X that's samples × features) What it is Where you meet it
U samples × k Each sample's coordinates in the new basis The transformed data, before scaling
Σ k × k diagonal Importance of each direction (singular values, descending) Deciding how many components to keep
Vᵀ k × features The principal directions themselves sklearn's components_

Note that NumPy hands you Σ as a 1-D array of singular values, not a square matrix — a small gotcha we'll handle in the demo.

Proving it: PCA-via-SVD equals sklearn, in NumPy

Talk is cheap. Let's show that the from-scratch SVD route lands on the same answer sklearn does. I'll use the classic iris dataset because it's tiny and ships with sklearn.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

X = load_iris().data            # 150 samples, 4 features
n = X.shape[0]

# --- PCA by hand, via SVD ---
X_centered = X - X.mean(axis=0)                       # step 1: center
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
# np.linalg.svd returns S as a 1-D array (singular values, descending)
# and Vt is ALREADY V-transpose: factorization is X_centered = U @ diag(S) @ Vt

components_manual = Vt                                # rows are principal directions
explained_var_manual = (S ** 2) / (n - 1)            # variance along each component

# Project the data onto the first 2 components
scores_manual = X_centered @ Vt[:2].T                # 150 x 2

# --- sklearn, for comparison ---
pca = PCA(n_components=2)
scores_sklearn = pca.fit_transform(X)

print("explained variance, mine   :", explained_var_manual[:2])
print("explained variance, sklearn:", pca.explained_variance_)

# As a fraction of total variance — the number people usually quote:
print("variance ratio, sklearn    :", pca.explained_variance_ratio_)

Run it and the two explained variance lines match to floating-point precision, because they're computing the literally identical thing two ways. The variance ratio line is the headline fact about this dataset: scikit-learn's own iris example reports a first-two-component split of roughly [0.92, 0.05] — so the first principal component alone captures about 92% of the total spread, and the first two together capture about 98%. That's why you can drop two of the four columns and barely notice.

One honest wrinkle, because I promised no hand-waving: the sign of a component is arbitrary. Flip a principal direction to point the opposite way and it's still the same axis — the variance is unchanged. sklearn applies a deterministic rule to pick a consistent sign, and your raw SVD might land on the opposite one. So when you compare the projected scores, compare them up to a possible sign flip per column:

# Each column should match sklearn's, possibly negated. Check magnitudes:
for j in range(2):
    same      = np.allclose(scores_manual[:, j],  scores_sklearn[:, j])
    flipped   = np.allclose(scores_manual[:, j], -scores_sklearn[:, j])
    print(f"component {j}: matches sklearn (same or flipped) = {same or flipped}")
# component 0: matches sklearn (same or flipped) = True
# component 1: matches sklearn (same or flipped) = True

Both print True. The directions are identical; only the arrow's sign convention differed. If a student ever shows me "my PCA is wrong, the signs are flipped" — this is why, and it's not a bug. The geometry is the same.

That short block is the entire mystery dissolved. fit_transform centers the data, runs an SVD, reads the components off Vᵀ, and projects. No magic — just the master decomposition.

Where each of these actually shows up in ML

This isn't an exam topic you forget after the test. The same three ideas keep surfacing:

  • Dimensionality reduction and compression. The headline use. 50 noisy, correlated features become 5 that carry almost all the variance — faster models, less overfitting, and plots you can actually see. Image compression is the same trick: keep the top singular values of an image matrix, discard the tiny ones, and you've stored a near-identical picture in a fraction of the space.
  • Embeddings and latent structure. When you reduce a big sparse matrix of, say, users-by-items or words-by-contexts down to a dense low-dimensional space, you're doing a truncated SVD. The "latent factors" in classic recommender systems and the older word-embedding methods are singular vectors. The geometry you just learned is the geometry of machine learning representation, not a side quest.
  • Least squares, done stably. Fitting a linear model means solving an over-determined system, and the numerically robust way to do it runs through the SVD (the pseudo-inverse is built from U, Σ, Vᵀ). When a library quietly gives you a sane answer for a near-singular regression instead of exploding, SVD is usually why.

Notice the through-line: every one of these is "find the directions that matter and drop the rest," which is exactly the eigenvector picture from the top of the page.

The recap

  • An eigenvector is the direction a transformation doesn't rotate; the eigenvalue is how hard it stretches along it. A @ v = λv. That one picture carries the whole topic.
  • PCA is eigen-decomposition of the covariance matrix. Its eigenvectors are the axes the data spreads along; the eigenvalues are the variances. Center your data first.
  • SVD (X = U Σ Vᵀ) is the master decomposition — every matrix is rotate-scale-rotate — and it reads PCA's components straight off Vᵀ without forming (and squaring) the covariance matrix, which is why real libraries use it.
  • In NumPy, np.linalg.svd gives you S as singular values and Vt already transposed. PCA-via-SVD reproduces sklearn exactly — up to an arbitrary per-component sign.
  • It shows up everywhere: compression, embeddings, recommender latent factors, stable least squares.

The point isn't to compute SVD by hand for the rest of your life — that's what the library is for. It's that when PCA does something unexpected, you can reason about why instead of re-rolling the dice.

If you're at the "I call it but can't quite picture it" stage with linear algebra and you'd like to get to the "oh, that's what it's doing" stage, that's a lot of what I do one-on-one — working through exactly these intuitions on your own datasets and problems. The first session is free if you'd like to bring your questions to linear algebra and build the picture properly. No pressure.

Enjoyed this post? Get the next one in your inbox.

A short, useful email when there's a new tutorial, study guide, or career-prep post on the blog. No spam, unsubscribe anytime.

Ali Jabbary

Written by Ali Jabbary

M.Sc., P.Eng. • Expert Data Scientist & ML Engineer with 10+ years of experience. 500+ students helped worldwide. Specializing in Python, AI/ML, and turning complex problems into simple solutions.

Want 1-on-1 help on this? Here's where to go next:

More articles you might find useful.

Book a free callMessage Ali