The Perceptron

Where Machine Learning Began

January 2026

What is Machine Learning?

Traditional programming:

Rules + Data → Output

Machine learning:

Data + Output → Rules

The Key Insight

Instead of writing rules by hand...

...let the computer learn them from examples

Example: Spam Filter

Traditional: write rules like "if contains Nigerian prince, mark spam"

ML: show thousands of labeled emails, let algorithm figure out the patterns

1943

McCulloch and Pitts

The first mathematical model of a neuron

A simple logic gate with binary outputs

1957

Frank Rosenblatt

Cornell Aeronautical Laboratory

Publishes the Perceptron learning rule

"The embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."

— The New York Times, 1958

The Hype Cycle

1957
Perceptron published
1969
Minsky and Papert critique
1970s
AI Winter
1986
Backpropagation revival

The Biological Inspiration

Neurons are nerve cells that process and transmit signals

How Biological Neurons Work

Multiple signals arrive at the dendrites

They are integrated in the cell body

If the accumulated signal exceeds a threshold...

An output signal is sent via the axon

The Artificial Neuron

x₁
x₂
x₃
Σ
ŷ

Inputs → Weighted Sum → Threshold → Output

The Math

Just arithmetic, nothing scary

Step 1: Weighted Sum (Net Input)

z = w₁x₁ + w₂x₂ + ... + wₘxₘ + b

Vector form: z = w · x + b

What Are These Terms?

x = input features (the data)

w = weights (importance of each feature)

b = bias (shifts the decision boundary)

z = net input (the weighted sum)

Step 2: Decision (Activation)

ŷ = 1 if z ≥ 0, else 0

This is called a unit step function

That Is the Whole Forward Pass

def predict(x, weights, bias):
    z = np.dot(x, weights) + bias
    return 1 if z >= 0 else 0

Intuition: Dot Product as Similarity

The dot product w · x measures how similar the input is to the weight vector

High similarity → positive z → class 1

Low similarity → negative z → class 0

The Geometry

What is the perceptron actually doing?

Finding a Decision Boundary

In 2D: the perceptron finds a line

In 3D: a plane

In nD: a hyperplane

The Decision Boundary Equation

w · x + b = 0

Points where z = 0

Classification by Position

w · x + b > 0 → class 1 (one side)

w · x + b < 0 → class 0 (other side)

What Controls the Boundary?

Weights control the orientation (which direction it points)

Bias controls the position (shifts it up/down)

The Learning Algorithm

How does it find the right weights?

The Perceptron Learning Rule

1
Initialize weights and bias to 0 or small random numbers
2
For each training example, compute the predicted output
3
Update the weights and bias
4
Repeat until convergence or max iterations

The Update Rule

w = w + η(y - ŷ)x
b = b + η(y - ŷ)

η (eta) = learning rate, typically 0.0 to 1.0

When Prediction Is Correct

True = 0, Predicted = 0: η(0-0)x = 0

True = 1, Predicted = 1: η(1-1)x = 0

No update needed

When Prediction Is Wrong

True = 1, Predicted = 0: η(1-0)x = ηx

Weights move toward the input

True = 0, Predicted = 1: η(0-1)x = -ηx

Weights move away from the input

Geometric Interpretation

Each update tilts the decision boundary

Trying to get the misclassified point on the correct side

The Code

Complete implementation in Python

The Perceptron Class

import numpy as np

class Perceptron:
    def __init__(self, eta=0.01, n_iter=50, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state
    
    def fit(self, X, y):
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc=0.0, scale=0.01, size=X.shape[1])
        self.b_ = np.float_(0.)
        self.errors_ = []
        
        for _ in range(self.n_iter):
            errors = 0
            for xi, target in zip(X, y):
                update = self.eta * (target - self.predict(xi))
                self.w_ += update * xi
                self.b_ += update
                errors += int(update != 0.0)
            self.errors_.append(errors)
        return self

Prediction Methods

def net_input(self, X):
    return np.dot(X, self.w_) + self.b_

def predict(self, X):
    return np.where(self.net_input(X) >= 0.0, 1, 0)

Why Random Initialization?

If all weights start at zero...

Learning rate only affects scale, not direction

Small random values avoid this problem

Training on Real Data

The Iris Dataset

The Classic ML Dataset

150 flower samples, 3 species

4 features: sepal length, sepal width, petal length, petal width

We will use 2 species, 2 features for visualization

Loading and Preparing Data

import pandas as pd

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(url, header=None, encoding='utf-8')

# Extract setosa and versicolor (first 100 samples)
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', 0, 1)

# Extract sepal length and petal length
X = df.iloc[0:100, [0, 2]].values

Training

ppn = Perceptron(eta=0.1, n_iter=10)
ppn.fit(X, y)

# Check convergence
print(ppn.errors_)
# [2, 2, 3, 2, 1, 0, 0, 0, 0, 0]

Converges after 6 epochs with zero errors

Live Demo

Open Interactive Demo

Click to add points, watch the perceptron learn

The Limitation

What the perceptron cannot do

The XOR Problem

XOR
0
1
0
0
1
1
1
0

Try drawing a single line to separate the 0s from the 1s

You cannot

Linear Separability

Convergence is only guaranteed if classes are linearly separable

If not separable, weights never stop updating

(unless you set a maximum number of epochs)

The Critique That Killed AI

1969: Minsky and Papert publish their analysis

Proved the perceptron cannot solve XOR

Funding dried up, research stopped

The AI Winter began

The Solution

Add more layers

Multi-Layer Perceptron (MLP)

Neural Networks

Deep Learning

What We Learned

ConceptPerceptronModern Neural Nets
Weighted sumYesYes
Activation functionYes (step)Yes (ReLU, etc)
Bias termYesYes
Iterative learningYesYes
Multiple layersNoYes
BackpropagationNoYes

The Takeaway

The perceptron is just:

if (dot_product + bias >= 0) return 1

But it introduced every key idea we still use today

Every layer of every deep network does the same thing:

weighted sum → activation → repeat

Resources

Next Time

Adaline and Gradient Descent

How to optimize with calculus

Questions?

i33ym.cc