The Algorithm Behind All Deep Learning
January 2026
Last time we learned:
z = w * x + b
y = 1 if z >= 0 else 0
Update rule: w = w + eta * (y - y_hat) * x
Converges on linearly separable data
Simple and elegant
But there is a problem...
No measure of "how wrong"
Perceptron only knows: right or wrong
Missing by 0.001? Wrong
Missing by 1000? Wrong
Same update either way
Imagine learning to throw darts blindfolded.
Someone only tells you "hit" or "miss" - never "a little to the left".
A way to measure how far we are from the target
Not just direction, but magnitude
A continuous measure of error
Adaptive Linear Neuron
Bernard Widrow and Ted Hoff, Stanford
Update weights based on the continuous output
Not the thresholded prediction
| Perceptron | Adaline | |
|---|---|---|
| Computes | z = w * x + b | z = w * x + b |
| Updates on | step(z) | z directly |
| Error signal | Binary | Continuous |
Threshold only used for final prediction, not learning
Measuring "how wrong" mathematically
Square the errors, take the average
Positive and negative errors do not cancel out
Larger errors are penalized more heavily
Differentiable everywhere
That last one is crucial...
Imagine a bowl-shaped surface
Height = loss (error)
Position = weight values
Goal: find the lowest point
MSE with linear model = convex surface
Only one minimum (global)
No local minima to get stuck in
The algorithm that powers all of deep learning
You are blindfolded on a hilly landscape.
To find the lowest point, feel which way is downhill and step that direction.
Repeat until you stop going down.
The gradient tells us which direction is "uphill"
It is the vector of partial derivatives
gradient = [dL/dw1, dL/dw2, ..., dL/db]
Move in the opposite direction of the gradient
Opposite because gradient points uphill
For MSE loss with linear activation:
Error times input, averaged over all samples
for each epoch:
output = X.dot(w) + b # net input for all samples
errors = y - output # continuous errors
w += eta * (2/n) * X.T.dot(errors) # update weights
b += eta * (2/n) * errors.sum() # update bias
The most important hyperparameter
Step size in weight space
How much to move each update
Overshoot the minimum
Bounce back and forth
May diverge to infinity
Loss explodes
Tiny steps toward minimum
Takes forever to converge
May stop before reaching minimum
Training is too slow
Steady descent toward minimum
Converges in reasonable time
Usually found by trial and error
Start with 0.01 or 0.1
If loss explodes, reduce by 10x
If too slow, increase by 10x
Modern optimizers adapt this automatically (Adam, RMSprop)
Why preprocessing matters
Features on different scales cause problems
Age: 0-100
Income: 0-1,000,000
Gradient descent struggles with elongated loss surfaces
Centers data at 0, scales to unit variance
Each feature now has mean=0, std=1
Loss surface becomes more "spherical"
Gradient points more directly to minimum
Same learning rate works for all features
Converges much faster
X_std = (X - X.mean(axis=0)) / X.std(axis=0)
Always standardize before gradient descent
Two flavors of gradient descent
Compute gradient over all training samples
One update per epoch
Smooth convergence
Expensive for large datasets
Compute gradient for one sample at a time
n updates per epoch
Noisy but fast
Can escape shallow local minima
Best of both worlds
Compute gradient over small batches (32, 64, 128)
This is what everyone uses in practice
SGD enables learning from streaming data
Update model as new samples arrive
No need to store entire dataset
Adapts to changing distributions
AdalineGD implementation
class AdalineGD:
def __init__(self, eta=0.01, n_iter=50, random_state=1):
self.eta = eta
self.n_iter = n_iter
self.random_state = random_state
def fit(self, X, y):
rgen = np.random.RandomState(self.random_state)
self.w_ = rgen.normal(loc=0.0, scale=0.01, size=X.shape[1])
self.b_ = np.float64(0.)
self.losses_ = []
# training loop on next slide...
for i in range(self.n_iter):
net_input = self.net_input(X)
output = self.activation(net_input)
errors = (y - output)
self.w_ += self.eta * 2.0 * X.T.dot(errors) / X.shape[0]
self.b_ += self.eta * 2.0 * errors.mean()
loss = (errors**2).mean()
self.losses_.append(loss)
return self
def net_input(self, X):
return np.dot(X, self.w_) + self.b_
def activation(self, X):
return X # identity function
def predict(self, X):
return np.where(self.activation(self.net_input(X)) >= 0.5, 1, 0)
# Standardize features
X_std = (X - X.mean(axis=0)) / X.std(axis=0)
# Train Adaline
ada = AdalineGD(eta=0.5, n_iter=20)
ada.fit(X_std, y)
# Check convergence
plt.plot(range(1, len(ada.losses_) + 1), ada.losses_)
plt.xlabel('Epochs')
plt.ylabel('MSE')
Compare Perceptron and Adaline side by side
Adaline is not just historical
Gradient descent is how we train all neural networks
GPT? Gradient descent.
Image recognition? Gradient descent.
Self-driving cars? Gradient descent.
| Adaline | Deep Networks |
|---|---|
| One layer | Many layers |
| Linear activation | Non-linear (ReLU, etc) |
| Direct gradient | Backpropagation |
| MSE loss | Cross-entropy, etc |
But the core algorithm is the same
1. Define a loss function that measures error
2. Compute the gradient of loss w.r.t. weights
3. Update weights in the opposite direction
4. Repeat until convergence
| Concept | What It Does |
|---|---|
| MSE Loss | Measures how wrong we are (continuously) |
| Gradient | Points uphill in weight space |
| Gradient Descent | Steps downhill to minimize loss |
| Learning Rate | Controls step size |
| Standardization | Makes optimization easier |
| SGD | Scales to large datasets |
Multi-Layer Perceptrons
Breaking the linear barrier with hidden layers
i33ym.cc