Supervised Image Classification

From satellite pixels to land cover maps using training data

2026-02-27

Global Forest Watch, maintained by the World Resources Institute, tracks deforestation in near-real time across every forested hectare on Earth using Landsat and Sentinel imagery. Every time a patch of forest is cleared — in the Amazon, the Congo Basin, the forests of Southeast Asia — the spectral signature of that pixel changes: less near-infrared, more red and shortwave infrared. Algorithms running on Google Earth Engine classify millions of pixels per day, distinguishing forest from agriculture, urban, bare soil, and water. The foundation of all of it is supervised image classification: the process of teaching a computer to recognise land cover types from their spectral signatures, then applying that knowledge across an entire image.

Supervised classification works because different land cover types reflect electromagnetic radiation differently. Water absorbs near-infrared. Healthy vegetation reflects it strongly. Urban surfaces reflect broadly across the visible. Soil reflectance varies with moisture and mineralogy. These spectral fingerprints are distinctive enough in multidimensional band space that a statistical classifier can learn them from a modest number of training pixels and generalise to the rest of the image. The maximum likelihood classifier — the workhorse of remote sensing for half a century — assumes each class follows a multivariate normal distribution and assigns each pixel to the class whose probability density is highest at that pixel’s spectral location. This model derives that classifier from first principles, introduces k-nearest neighbour as a non-parametric alternative, and shows how the confusion matrix measures classification accuracy.

1. The Question

How do you create a land cover map from a Landsat satellite image?

Image classification assigns every pixel to a category (class):

Land cover classes: - Water - Forest - Urban/Built-up - Agriculture (crops) - Grassland/Pasture - Barren land

Supervised classification workflow: 1. Select training samples: Identify pixels of known class 2. Extract signatures: Calculate spectral characteristics per class 3. Train classifier: Learn decision boundaries in feature space 4. Classify image: Assign every pixel to most likely class 5. Assess accuracy: Compare results to ground truth

The mathematical question: Given training data (pixels with known labels) in multi-dimensional spectral space, how do we optimally classify unknown pixels?


2. The Conceptual Model

Feature Space

Each pixel is a point in n-dimensional space where n = number of bands.

Landsat 8 example (6 bands): - Band 2 (Blue): 0.45-0.51 μm - Band 3 (Green): 0.53-0.59 μm - Band 4 (Red): 0.64-0.67 μm - Band 5 (NIR): 0.85-0.88 μm - Band 6 (SWIR1): 1.57-1.65 μm - Band 7 (SWIR2): 2.11-2.29 μm

Pixel = 6D point: (B_2, B_3, B_4, B_5, B_6, B_7)

Classes form clusters in feature space: - Water: Low NIR (absorbed), moderate visible - Vegetation: Low red (chlorophyll absorption), high NIR (leaf scattering) - Urban: Moderate across all bands - Bare soil: Red-NIR similar, high SWIR

Training Data

Training samples: Pixels of known class (ground truth).

Collection methods: - Field surveys (GPS + visual ID) - High-resolution imagery interpretation - Existing maps - Expert knowledge

Requirements: - Representative of class variability - Spatially distributed across scene - Sufficient quantity (50-100 pixels per class minimum)

Classification Algorithms

1. Maximum Likelihood (Parametric)

Assumes class signatures follow multivariate normal distribution.

Classifies to class with highest probability given pixel values.

2. K-Nearest Neighbors (Non-parametric)

Finds k closest training pixels in feature space.

Assigns majority class among neighbors.

3. Decision Trees / Random Forest

Learns hierarchical rules (if-then statements).

Ensembles of trees for robustness.

4. Support Vector Machines

Finds optimal hyperplanes separating classes.

Effective in high dimensions.


3. Building the Mathematical Model

Multivariate Normal Distribution

Class i signature: Mean vector \boldsymbol{\mu}_i and covariance matrix \boldsymbol{\Sigma}_i

For pixel with values \mathbf{x} = (x_1, x_2, \ldots, x_n):

Probability density:

p(\mathbf{x} | C_i) = \frac{1}{(2\pi)^{n/2} |\boldsymbol{\Sigma}_i|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_i)^T \boldsymbol{\Sigma}_i^{-1} (\mathbf{x} - \boldsymbol{\mu}_i)\right)

Where: - n = number of bands - |\boldsymbol{\Sigma}_i| = determinant of covariance matrix - \boldsymbol{\Sigma}_i^{-1} = inverse of covariance matrix

Mahalanobis distance:

D_i^2 = (\mathbf{x} - \boldsymbol{\mu}_i)^T \boldsymbol{\Sigma}_i^{-1} (\mathbf{x} - \boldsymbol{\mu}_i)

Measures distance in units of standard deviations (accounts for correlation).

Maximum Likelihood Classification

Bayes’ theorem:

P(C_i | \mathbf{x}) = \frac{p(\mathbf{x} | C_i) P(C_i)}{p(\mathbf{x})}

Where: - P(C_i | \mathbf{x}) = posterior probability (class given pixel) - p(\mathbf{x} | C_i) = likelihood (pixel given class) - P(C_i) = prior probability (class frequency) - p(\mathbf{x}) = evidence (constant for all classes)

Classification rule:

\hat{C} = \arg\max_i P(C_i | \mathbf{x})

Equivalent (taking log):

\hat{C} = \arg\max_i \left[\ln P(C_i) - \frac{1}{2}\ln|\boldsymbol{\Sigma}_i| - \frac{1}{2}D_i^2\right]

If equal priors and equal covariances: Simplifies to minimum Mahalanobis distance.

K-Nearest Neighbors (k-NN)

Algorithm:

For pixel x:
    1. Calculate distance to all training pixels
    2. Find k nearest neighbors
    3. Count class frequencies among neighbors
    4. Assign majority class

Distance metric (Euclidean in feature space):

d(\mathbf{x}, \mathbf{x}_j) = \sqrt{\sum_{b=1}^{n} (x_b - x_{jb})^2}

Typical k values: 3, 5, 7 (odd to avoid ties)

Pros: Simple, no distribution assumptions
Cons: Computationally expensive, sensitive to k choice

Accuracy Assessment

Confusion matrix: Cross-tabulation of predicted vs. actual classes.

Example:

Water (pred) Forest (pred) Urban (pred)
Water (true) 85 2 3
Forest (true) 1 92 7
Urban (true) 5 6 89

Metrics:

Overall accuracy:

OA = \frac{\sum \text{diagonal}}{\text{total}} = \frac{85 + 92 + 89}{300} = 88.7\%

Producer’s accuracy (recall):

PA_i = \frac{\text{correct}_i}{\text{reference total}_i}

User’s accuracy (precision):

UA_i = \frac{\text{correct}_i}{\text{classified total}_i}

Kappa coefficient: Agreement beyond chance

\kappa = \frac{OA - P_e}{1 - P_e}

Where P_e = expected accuracy by chance.


4. Worked Example by Hand

Problem: Classify a pixel using maximum likelihood.

Pixel values: Red = 0.10, NIR = 0.45

Training statistics (2 bands for simplicity):

Class 1 (Vegetation): - Mean: \boldsymbol{\mu}_1 = (0.08, 0.50) (low red, high NIR) - Covariance: \boldsymbol{\Sigma}_1 = \begin{pmatrix} 0.01 & 0 \\ 0 & 0.02 \end{pmatrix}

Class 2 (Soil): - Mean: \boldsymbol{\mu}_2 = (0.25, 0.30) (moderate both) - Covariance: \boldsymbol{\Sigma}_2 = \begin{pmatrix} 0.02 & 0 \\ 0 & 0.02 \end{pmatrix}

Assume equal priors: P(C_1) = P(C_2) = 0.5

Solution

Step 1: Compute Mahalanobis distance to each class

Class 1 (Vegetation):

\mathbf{x} - \boldsymbol{\mu}_1 = (0.10 - 0.08, 0.45 - 0.50) = (0.02, -0.05)

\boldsymbol{\Sigma}_1^{-1} = \begin{pmatrix} 100 & 0 \\ 0 & 50 \end{pmatrix}

D_1^2 = (0.02, -0.05) \begin{pmatrix} 100 & 0 \\ 0 & 50 \end{pmatrix} \begin{pmatrix} 0.02 \\ -0.05 \end{pmatrix}

= (2, -2.5) \begin{pmatrix} 0.02 \\ -0.05 \end{pmatrix}

= 0.04 + 0.125 = 0.165

Class 2 (Soil):

\mathbf{x} - \boldsymbol{\mu}_2 = (0.10 - 0.25, 0.45 - 0.30) = (-0.15, 0.15)

\boldsymbol{\Sigma}_2^{-1} = \begin{pmatrix} 50 & 0 \\ 0 & 50 \end{pmatrix}

D_2^2 = (-0.15, 0.15) \begin{pmatrix} 50 & 0 \\ 0 & 50 \end{pmatrix} \begin{pmatrix} -0.15 \\ 0.15 \end{pmatrix}

= (-7.5, 7.5) \begin{pmatrix} -0.15 \\ 0.15 \end{pmatrix}

= 1.125 + 1.125 = 2.25

Step 2: Compute discriminant function

Assume equal covariances → ignore \ln|\boldsymbol{\Sigma}_i| term

Class 1: g_1 = \ln(0.5) - 0.5(0.165) = -0.693 - 0.083 = -0.776

Class 2: g_2 = \ln(0.5) - 0.5(2.25) = -0.693 - 1.125 = -1.818

Step 3: Classify

\max(g_1, g_2) = \max(-0.776, -1.818) = -0.776

Classification: Vegetation (Class 1)

Interpretation: Pixel is closer to vegetation signature (low red, high NIR).


5. Computational Implementation

Below is an interactive image classification demo.

<label>
  Classification method:
  <select id="class-method">
    <option value="maximum-likelihood" selected>Maximum Likelihood</option>
    <option value="knn">K-Nearest Neighbors (k=5)</option>
    <option value="minimum-distance">Minimum Distance</option>
  </select>
</label>
<label>
  Show feature space:
  <input type="checkbox" id="show-feature-space" checked>
</label>
<label>
  Show training samples:
  <input type="checkbox" id="show-training" checked>
</label>
<div class="classify-info">
  <p><strong>Classes:</strong> Water (blue), Vegetation (green), Urban (gray), Bare soil (brown)</p>
  <p><strong>Overall accuracy:</strong> <span id="overall-accuracy">--</span>%</p>
</div>
<canvas id="classify-canvas" width="700" height="350" style="border: 1px solid #ddd;"></canvas>

Try this: - Maximum likelihood: Statistical classification based on class distributions - K-nearest neighbors: Non-parametric, uses nearest training samples - Minimum distance: Simplest method, distance to class means - Feature space plot: Shows class separation in Red-NIR space - Training samples: White boxes show training regions - Color legend: Blue (water), Green (vegetation), Gray (urban), Brown (soil) - Notice: Classes cluster in feature space—high NIR separates vegetation!

Key insight: Spectral signatures enable automatic land cover mapping from satellite imagery!


6. Interpretation

Global Land Cover Mapping

Products: - MODIS Land Cover (500m, annual, 2001-present) - ESA CCI Land Cover (300m, annual, 1992-2020) - FROM-GLC (30m, Landsat-based)

Uses: - Climate modelling (surface albedo, roughness) - Carbon cycle (vegetation biomass) - Agriculture monitoring - Urban growth tracking

Crop Type Mapping

USDA Cropland Data Layer (30m, annual, US): - Corn vs. soybeans vs. wheat - Training: Field surveys + farmer reports - Accuracy: >90% for major crops

Applications: - Yield prediction - Irrigation demand - Crop insurance - Policy planning

Forest Mapping

Forest/non-forest: - Binary classification (easier, higher accuracy) - Global Forest Change product (Hansen et al.) - 30m resolution, 2000-2023

Forest type: - Deciduous vs. evergreen vs. mixed - Requires phenology (time series)


7. What Could Go Wrong?

Insufficient Training Data

Small sample size → Poor class statistics → Misclassification

Solution: - Collect 50-100 pixels per class minimum - Ensure spatial distribution across scene - Include class variability (e.g., young vs. mature forest)

Class Confusion

Spectrally similar classes: - Dry grass vs. bare soil - Urban vs. bare rock - Shallow water vs. wet sand

Separability analysis:

D_{ij} = \frac{||\boldsymbol{\mu}_i - \boldsymbol{\mu}_j||^2}{\sigma_i^2 + \sigma_j^2}

Low separability (< 1.0): Consider merging classes or adding bands.

Shadow and Topography

Terrain shadows reduce brightness → misclassified as water.

Solution: - Topographic correction (normalize for solar angle + slope) - Use indices less sensitive to illumination (e.g., NDVI)

Mixed Pixels

30m Landsat pixel may contain multiple cover types: - Edge between forest and field - Urban with scattered vegetation

Sub-pixel classification: - Estimate fractional cover (30% forest, 70% grass) - Spectral unmixing algorithms


8. Extension: Random Forest Classification

Ensemble of decision trees.

Algorithm:

For each tree:
    1. Bootstrap sample of training data
    2. At each node:
       - Select random subset of features
       - Find best split on these features
    3. Grow tree to maximum depth
    4. Store tree

Classification:
    - Each tree votes
    - Assign majority class

Advantages: - Handles non-linear relationships - Robust to outliers - Feature importance ranking - High accuracy

Disadvantages: - Black box (hard to interpret) - Computationally intensive

Widely used in operational land cover mapping.


9. Math Refresher: Multivariate Normal Distribution

Univariate Normal

1D Gaussian:

p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Parameters: Mean \mu, variance \sigma^2

68-95-99.7 rule: 68% within 1σ, 95% within 2σ, 99.7% within 3σ

Multivariate Normal

n-dimensional:

p(\mathbf{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{n/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)

Parameters: - Mean vector \boldsymbol{\mu} = (\mu_1, \mu_2, \ldots, \mu_n) - Covariance matrix \boldsymbol{\Sigma} (n×n, symmetric, positive definite)

Covariance:

\Sigma_{ij} = \text{Cov}(X_i, X_j) = E[(X_i - \mu_i)(X_j - \mu_j)]

Diagonal elements: Variances \sigma_i^2
Off-diagonal: Covariances (measure linear relationship)

Mahalanobis distance:

D^2 = (\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})

Generalizes Euclidean distance, accounting for variance and correlation.


Summary