Skip to main content
A foggy hilly loss landscape with a path stepping downhill toward the lowest valley, an arrow showing the gradient pointing uphill and the optimizer stepping the opposite way
From the blog · by Ali Jabbary

Gradient descent, intuitively: the one optimization idea behind nearly all of ML

Ali Jabbary
Ali Jabbary
M.Sc., P.Eng.
9 min read
#mathematics#machine-learning#gradient-descent#optimization#numpy#learning-rate

Article Summary

Loss surfaces, the gradient as steepest-uphill, learning rate too big vs too small, and a 20-line NumPy example — plus where SGD, momentum and Adam fit.

A student once asked me, halfway through a session, "Wait — so how does the model actually learn? Where's the part where it gets smarter?" Fair question. We'd been throwing around words like training and fitting for an hour as if they were self-explanatory. They're not.

Here's the honest answer: nearly every model you've heard of — linear regression, logistic regression, neural networks, the transformer behind your favourite chatbot — learns by repeatedly nudging its numbers in whatever direction makes its mistakes a little smaller. That nudging procedure has a name. It's gradient descent, and once it clicks, an enormous amount of machine learning stops being magic and starts being arithmetic.

Let me build the picture from the ground up. No prerequisites beyond "I remember roughly what a slope is."

The loss surface: a landscape made of mistakes

Start with the thing a model is trying to do: make good predictions. To "make good" you need a way to score how bad the current predictions are. That score is called the loss (or cost). Low loss = good model. High loss = bad model. The entire game is: make the loss small.

Now here's the key mental shift. Your model has knobs — call them parameters, or weights. A simple model might have two. A neural network might have billions. For every possible setting of those knobs, you get some loss value. If you could plot loss as a function of the knobs, you'd get a surface.

With one knob, it's a curve — a 2D plot, knob on the x-axis, loss on the y-axis. With two knobs, it's an actual landscape: a sheet of hills and valleys floating in 3D, where altitude is loss. With a billion knobs... well, nobody can picture that, and you don't need to. The two-knob landscape is the intuition that scales. Training a model is finding the lowest point in this landscape.

You can't see the whole landscape, though. That's the catch. You're standing somewhere on it, in fog, and all you can feel is the ground right under your feet — the slope. The question becomes: which way is downhill, and how big a step do I take?

The gradient: a compass that points straight uphill

This is where the one idea lives.

For a curve with a single knob, the slope (the derivative) tells you two things at once: which way the curve tilts, and how steeply. A positive slope means "going right makes loss go up." A negative slope means "going right makes loss go down." So to decrease loss, you move in the direction opposite the slope. Slope positive → step left. Slope negative → step right. Always against the slope.

The gradient is just the multi-knob version of the slope. With two or a billion knobs, the gradient is a list — one slope per knob — bundled into a vector. And that vector has a beautiful property: it points in the direction of steepest ascent. Steepest uphill. The single fastest way to increase the loss.

Which sounds backwards until you realise: we wanted downhill, and the exact opposite of "steepest uphill" is "steepest downhill." So the recipe writes itself.

The whole algorithm, in one line: stand where you are, compute the gradient (steepest uphill), then take a step in the opposite direction. Repeat until the ground is flat.

In symbols, for a parameter w:

w = w - learning_rate * gradient

That's it. That single update rule, applied over and over, is what "training" means under the hood for an astonishing fraction of machine learning. The minus sign is the "go downhill" part. Everything else is bookkeeping.

The learning rate: the size of your stride

learning_rate (often written lr, or the Greek letter η, "eta") controls how big a step you take each iteration. It is the single most consequential dial you'll touch, and it's where most people's first models quietly fail.

Think of it as your stride length in the foggy landscape.

  • Too small, and you inch downhill grain by grain. You'll get there eventually, but "eventually" might mean ten thousand iterations when a hundred would've done. Slow, but at least it works.
  • Too big, and you overshoot the valley entirely — you stride straight past the bottom and land partway up the opposite slope, then overshoot back, bouncing around the valley forever, or worse, flinging yourself further out each time until the loss explodes to infinity. (When someone says their loss "went to NaN," a learning rate that's too large is suspect number one.)

The annoying truth is there's no universal right value. Common starting points live around 0.1, 0.01, or 0.001, and you tune from there by watching whether the loss goes down smoothly. If it diverges, shrink it. If it crawls, grow it. A loss curve that decreases steadily is the single best sign your learning rate is in the right ballpark.

Let's actually do it: minimizing a function from scratch

Enough words. Let's find the lowest point of the simplest possible loss surface, f(x) = x², using nothing but the update rule above. The minimum is obviously at x = 0 — which is exactly why it's a good test. We can check the answer.

The derivative of is 2x. (That's our gradient.) We'll start at x = 10, far from the bottom, and walk downhill.

import numpy as np

def f(x):
    return x ** 2          # the loss surface

def gradient(x):
    return 2 * x           # the slope at x

x = 10.0                   # starting point, deliberately far from 0
lr = 0.1                   # learning rate (our stride length)

for step in range(20):
    g = gradient(x)        # which way is uphill, and how steep
    x = x - lr * g         # step the opposite way
    print(f"step {step:2d}:  x = {x:8.5f}   loss = {f(x):10.7f}")

Let's trace the first few steps by hand, because seeing the numbers move is the whole point. With lr = 0.1 and gradient 2x, the update simplifies to x = x - 0.1 * 2x = x - 0.2x = 0.8x. Each step multiplies x by 0.8:

step  0:  x = 8.00000    loss = 64.0000000
step  1:  x = 6.40000    loss = 40.9600000
step  2:  x = 5.12000    loss = 26.2144000
step  3:  x = 4.09600    loss = 16.7772160
step  4:  x = 3.27680    loss = 10.7374182
...
step 19:  x = 0.11529    loss =  0.0132923

It marches toward zero, exactly as promised. The loss shrinks fast at first (the slope is steep far from the bottom, so the steps are big) and slows as it nears the minimum (the slope flattens, the steps shrink automatically). That self-slowing is a feature, not a coincidence — it's baked into the fact that the gradient is the slope.

Now go change one number and watch it break, because that's where the intuition really sets in:

  • Set lr = 1.0. The update becomes x = x - 1.0 * 2x = -x. You'll bounce between 10 and -10 forever, never improving. Perfect oscillation.
  • Set lr = 1.1. Now each step grows. x flies off to ±infinity and the loss explodes. That's divergence — your NaN.
  • Set lr = 0.001. It works, but it'll crawl. After 20 steps you've barely left 10.

Three lines of difference between "converges nicely," "bounces forever," and "blows up." That's the learning rate earning its reputation.

Local minima, and other honest caveats

The bowl has exactly one low point, so gradient descent can't go wrong. Real loss surfaces are lumpier. A function with several valleys has several local minima — bottoms that are low compared to their immediate surroundings but not the lowest point overall (the global minimum). Plain gradient descent is greedy and a little dim: it walks downhill from wherever it happens to start and stops at the first flat spot it reaches. Start in the wrong valley, get stuck in the wrong valley.

For decades people assumed this made training big networks hopeless. It mostly didn't, for reasons that are still partly an active research story — in very high dimensions, truly bad local minima turn out to be rarer than the intuition suggests, and most flat-ish spots you hit are "good enough." But the concern is real, and it's why where you start (initialization) and how you step (the variants below) genuinely matter.

What this clean intuition deliberately leaves out — name them so you can look them up:

  • Stochastic gradient descent (SGD). Computing the gradient over your entire dataset every step is expensive. SGD estimates it from a small random batch instead — noisier, but far faster, and the noise even helps shake free of bad spots. In practice almost nobody runs "full-batch" gradient descent; they run SGD.
  • Momentum. Instead of stepping purely on the current slope, you accumulate a running "velocity," like a ball rolling downhill. It powers through small bumps and dampens the oscillation we saw at lr = 1.0.
  • Adam. The optimizer most people actually reach for. It keeps a per-parameter, self-adjusting effective learning rate — big knobs and small knobs get different strides — combining momentum-like and rate-adapting tricks. When a tutorial says "we used Adam," this is what's happening: gradient descent, with smarter, automatic stride control.

Every one of those is a refinement of the exact same loop you just ran. None of them throws away the core idea. They all still compute a gradient and step against it. They just argue about how big the step should be and which direction counts as "down" once you account for momentum and noise.

The recap

  • A model's mistakes form a loss surface over its parameters; training means finding the low point.
  • The gradient points in the direction of steepest increase; step the opposite way to go down: w = w - lr * gradient.
  • The learning rate is your stride: too small crawls, too big overshoots or explodes. A smoothly decreasing loss curve is your green light.
  • Local minima are real but, in high dimensions, less ruinous than they sound.
  • SGD, momentum, and Adam are all the same idea with better step control — start with the plain version, graduate to Adam.

If you can run that twenty-line loop, change the learning rate, and predict whether it'll converge before you hit enter, you understand the engine inside nearly every model you'll ever train. Everything fancier is a variation on this theme.

This is one of those topics where five minutes of someone watching you tweak the learning rate live — "okay, now make it bigger, what do you think happens?" — beats five hours of reading. It's exactly the kind of thing I love working through one-on-one, because the moment it clicks, it stays clicked.

Enjoyed this post? Get the next one in your inbox.

A short, useful email when there's a new tutorial, study guide, or career-prep post on the blog. No spam, unsubscribe anytime.

Ali Jabbary

Written by Ali Jabbary

M.Sc., P.Eng. • Expert Data Scientist & ML Engineer with 10+ years of experience. 500+ students helped worldwide. Specializing in Python, AI/ML, and turning complex problems into simple solutions.

Want 1-on-1 help on this? Here's where to go next:

More articles you might find useful.

Book a free callMessage Ali