Regression for Continuous Variables

Predicting spatial fields from spectral and environmental features

2026-03-01

After the 2019–20 Australian Black Summer fires burned 18.6 million hectares — an area larger than Cambodia — ecologists needed rapid estimates of fire severity across landscapes too vast and too dangerous to survey on foot. The solution was regression. Spectral indices from Landsat and Sentinel-2 satellites, calibrated against 5,000 field plots where trained observers had recorded char depth, crown scorch height, and tree mortality, were used to predict a continuous fire severity score for every 10-metre pixel across the fire perimeter. The result was a map that would have taken a decade to produce in the field, delivered in weeks from a laptop.

This is the core bargain of supervised regression: a small number of expensive, precise measurements (field plots, sensor deployments, laboratory analyses) are used to train a model that can then predict the same quantity cheaply, at every point in space, from easy-to-acquire satellite observations. The relationship between predictors and the target variable is learned from data, not hardcoded from theory — which makes regression enormously flexible, and also capable of learning relationships that exist only in the training set and nowhere else.

The machinery underneath is elegant. Linear regression minimises a squared error loss by gradient descent, finding the weight vector that best fits the training data. Regularisation (L1 and L2 penalties) prevents the model from over-fitting by penalising extreme weights. Spatial cross-validation replaces standard k-fold validation, because geographic data violates independence: nearby pixels share the same weather, geology, and disturbance history, and a model that looks accurate in standard cross-validation may have simply memorised spatial autocorrelation. Getting the validation right is as important as getting the model right.

1. The Question

Given satellite imagery and a few field measurements, can we predict soil moisture across an entire watershed?

Direct measurement doesn’t scale.

Soil moisture probe: One location = $500

Watershed area: 100 km² = 10 million 10×10m cells

Full coverage: $5 billion (impossible)

Regression provides solution:

Measure 100 locations with probes ($50,000).

Extract satellite features (vegetation, thermal, topography).

Learn relationship: Soil moisture = f(features)

Predict all 10 million cells.

Applications: - Digital elevation models (DEM from spectral features) - Soil property mapping (carbon, pH, texture) - Biomass estimation (forest carbon stocks) - Temperature mapping (urban heat islands) - Crop yield prediction (precision agriculture) - Snow water equivalent (mountain hydrology)

Why regression works:

Physical relationships exist between observables and targets.

Examples: - Higher NDVI → More biomass (vegetation absorbs carbon) - Lower thermal radiance → Higher soil moisture (evaporative cooling) - Steeper terrain → Lower soil depth (erosion) - Higher NIR/Red ratio → Taller vegetation → More elevation locally

Regression learns these patterns from data.


2. The Conceptual Model

Linear Regression

Simplest model: Linear relationship

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon

Where: - y = target variable (e.g., elevation) - x_1, ..., x_p = features (NDVI, slope, aspect, etc.) - \beta_0 = intercept - \beta_1, ..., \beta_p = coefficients (weights) - \epsilon = error term (noise)

Matrix notation:

\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

Where: - \mathbf{y} = n \times 1 vector of observations - \mathbf{X} = n \times (p+1) design matrix (includes intercept column) - \boldsymbol{\beta} = (p+1) \times 1 coefficient vector - \boldsymbol{\epsilon} = n \times 1 error vector

Goal: Find \boldsymbol{\beta} minimizing prediction error

Loss Function

Mean Squared Error (MSE):

L(\boldsymbol{\beta}) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2

Why squared error? - Differentiable (enables gradient descent) - Penalizes large errors more (quadratic) - Corresponds to maximum likelihood under Gaussian noise

Minimization:

Take derivative, set to zero:

\frac{\partial L}{\partial \boldsymbol{\beta}} = -\frac{2}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = 0

Closed-form solution (Normal Equation):

\boldsymbol{\beta}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Issues with Normal Equation: - Requires matrix inversion (O(p^3) computation) - Fails if \mathbf{X}^T\mathbf{X} singular (correlated features) - Memory intensive for large datasets

Alternative: Gradient Descent

Gradient Descent

Iterative optimization:

Start with random \boldsymbol{\beta}^{(0)}

Repeat until convergence:

\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \alpha \nabla L(\boldsymbol{\beta}^{(t)})

Where: - \alpha = learning rate (step size) - \nabla L = gradient of loss function

Gradient calculation:

\nabla L(\boldsymbol{\beta}) = \frac{\partial L}{\partial \boldsymbol{\beta}} = -\frac{2}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})

Algorithm:

Initialize β randomly
for epoch in 1 to max_epochs:
    predictions = X @ β
    errors = y - predictions
    gradient = -(2/n) * X.T @ errors
    β = β - learning_rate * gradient
    
    if ||gradient|| < tolerance:
        break

Advantages: - Scalable to large datasets - No matrix inversion - Works with regularization

Regularization

Problem: Overfitting

Complex model fits training noise.

Solution: Penalty on coefficient magnitude

Ridge Regression (L2):

L_{ridge}(\boldsymbol{\beta}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|^2

Penalizes large coefficients, shrinks toward zero.

Lasso Regression (L1):

L_{lasso}(\boldsymbol{\beta}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \sum_{j=1}^p |\beta_j|

Encourages sparsity (some coefficients exactly zero).

Elastic Net (combination):

L_{elastic}(\boldsymbol{\beta}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda_1 \|\boldsymbol{\beta}\|^2 + \lambda_2 \sum_{j=1}^p |\beta_j|

Hyperparameter \lambda: - \lambda = 0: No regularization (ordinary least squares) - \lambda \to \infty: All coefficients → 0 - Optimal \lambda: Found via cross-validation


3. Building the Mathematical Model

Deriving Gradient Descent Update

Loss function:

L(\boldsymbol{\beta}) = \frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2

Expand:

L(\boldsymbol{\beta}) = \frac{1}{n}\sum_{i=1}^n y_i^2 - 2y_i\mathbf{x}_i^T\boldsymbol{\beta} + (\mathbf{x}_i^T\boldsymbol{\beta})^2

Gradient with respect to \boldsymbol{\beta}:

\frac{\partial L}{\partial \boldsymbol{\beta}} = \frac{1}{n}\sum_{i=1}^n -2y_i\mathbf{x}_i + 2\mathbf{x}_i(\mathbf{x}_i^T\boldsymbol{\beta})

= \frac{2}{n}\sum_{i=1}^n \mathbf{x}_i(\mathbf{x}_i^T\boldsymbol{\beta} - y_i)

Matrix form:

\nabla L = \frac{2}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} - \mathbf{y})

Update rule:

\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \frac{2\alpha}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta}^{(t)} - \mathbf{y})

Learning rate selection:

Too small: Slow convergence
Too large: Divergence (overshoots minimum)

Typical: \alpha = 0.01 to $0.1$, or adaptive (decrease over time)

Ridge Regression Gradient

Loss with L2 penalty:

L_{ridge}(\boldsymbol{\beta}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|^2

Gradient:

\nabla L_{ridge} = \frac{2}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} - \mathbf{y}) + 2\lambda\boldsymbol{\beta}

Update:

\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \alpha[\frac{2}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta}^{(t)} - \mathbf{y}) + 2\lambda\boldsymbol{\beta}^{(t)}]

Closed form (if desired):

\boldsymbol{\beta}^*_{ridge} = (\mathbf{X}^T\mathbf{X} + n\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}

Adding \lambda\mathbf{I} ensures invertibility even with correlated features.

Cross-Validation

K-Fold Cross-Validation:

  1. Split data into K folds (typical: K=5 or $10$)
  2. For each fold k:
  3. Average: CV_{error} = \frac{1}{K}\sum_{k=1}^K E_k

Use for: - Hyperparameter tuning (\lambda selection) - Model selection (compare algorithms) - Performance estimation (unbiased error estimate)

Spatial Cross-Validation:

Standard CV invalid for spatial data (autocorrelation).

Solution: Spatial blocks

Divide region into spatial blocks.

Each fold = one or more blocks.

Ensures test data spatially separated from training.


4. Worked Example by Hand

Problem: Predict elevation from NDVI and slope.

Training data:

Location NDVI Slope Elevation (m)
1 0.7 450
2 0.4 15° 850
3 0.6 600
4 0.3 20° 1100
5 0.5 12° 750

Goal: Fit linear model manually using gradient descent.

Step 1: Set up matrices

Feature matrix \mathbf{X} (with intercept):

\mathbf{X} = \begin{bmatrix} 1 & 0.7 & 5 \\ 1 & 0.4 & 15 \\ 1 & 0.6 & 8 \\ 1 & 0.3 & 20 \\ 1 & 0.5 & 12 \end{bmatrix}

Target vector \mathbf{y}:

\mathbf{y} = \begin{bmatrix} 450 \\ 850 \\ 600 \\ 1100 \\ 750 \end{bmatrix}

Step 2: Initialize coefficients

\boldsymbol{\beta}^{(0)} = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} (start at origin)

Step 3: First iteration

Predictions:

\hat{\mathbf{y}}^{(0)} = \mathbf{X}\boldsymbol{\beta}^{(0)} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}

Errors:

\mathbf{y} - \hat{\mathbf{y}}^{(0)} = \begin{bmatrix} 450 \\ 850 \\ 600 \\ 1100 \\ 750 \end{bmatrix}

Gradient:

\nabla L^{(0)} = -\frac{2}{5}\mathbf{X}^T(\mathbf{y} - \hat{\mathbf{y}}^{(0)})

\mathbf{X}^T = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 0.7 & 0.4 & 0.6 & 0.3 & 0.5 \\ 5 & 15 & 8 & 20 & 12 \end{bmatrix}

\mathbf{X}^T(\mathbf{y} - \hat{\mathbf{y}}^{(0)}) = \begin{bmatrix} 3750 \\ 1970 \\ 48500 \end{bmatrix}

\nabla L^{(0)} = -\frac{2}{5}\begin{bmatrix} 3750 \\ 1970 \\ 48500 \end{bmatrix} = \begin{bmatrix} -1500 \\ -788 \\ -19400 \end{bmatrix}

Update (learning rate \alpha = 0.001):

\boldsymbol{\beta}^{(1)} = \boldsymbol{\beta}^{(0)} - 0.001 \times \begin{bmatrix} -1500 \\ -788 \\ -19400 \end{bmatrix}

= \begin{bmatrix} 1.5 \\ 0.788 \\ 19.4 \end{bmatrix}

Step 4: Second iteration

Predictions:

\hat{y}_1^{(1)} = 1.5 + 0.788(0.7) + 19.4(5) = 1.5 + 0.552 + 97 = 99.05

\hat{y}_2^{(1)} = 1.5 + 0.788(0.4) + 19.4(15) = 1.5 + 0.315 + 291 = 292.82

(Continue for all 5 samples…)

New errors, gradient, update…

Step 5: Convergence

After ~1000 iterations (or until \|\nabla L\| < 0.01):

Final coefficients (approximate):

\boldsymbol{\beta}^* \approx \begin{bmatrix} 200 \\ -300 \\ 45 \end{bmatrix}

Model:

Elevation = 200 - 300 \times NDVI + 45 \times Slope

Interpretation:

Step 6: Prediction

New location: NDVI = 0.5, Slope = 10°

\hat{Elevation} = 200 - 300(0.5) + 45(10) = 200 - 150 + 450 = 500 \text{ m}

Step 7: Evaluate

Test on training data (for illustration):

Actual Predicted Error
450 435 15
850 820 30
600 590 10
1100 1095 5
750 740 10

RMSE: \sqrt{\frac{15^2+30^2+10^2+5^2+10^2}{5}} = \sqrt{\frac{1350}{5}} = 16.4 m

R²: (coefficient of determination)

R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{1350}{...} \approx 0.95

Good fit on training data. (Would need test set for true evaluation.)


5. Computational Implementation

Tier 1: Foundation Path (Pi/Laptop)

Implementation from scratch (Python/NumPy):

import numpy as np

class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000, 
                 regularization='none', lambda_=0.1):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.regularization = regularization
        self.lambda_ = lambda_
        self.weights = None
        self.bias = None
        self.loss_history = []
    
    def fit(self, X, y):
        """Train linear regression via gradient descent"""
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.n_iter):
            # Forward pass
            y_pred = self.predict(X)
            
            # Compute loss
            loss = np.mean((y - y_pred)**2)
            
            # Add regularization to loss
            if self.regularization == 'ridge':
                loss += self.lambda_ * np.sum(self.weights**2)
            elif self.regularization == 'lasso':
                loss += self.lambda_ * np.sum(np.abs(self.weights))
            
            self.loss_history.append(loss)
            
            # Compute gradients
            dw = -(2/n_samples) * X.T @ (y - y_pred)
            db = -(2/n_samples) * np.sum(y - y_pred)
            
            # Add regularization to gradients
            if self.regularization == 'ridge':
                dw += 2 * self.lambda_ * self.weights
            elif self.regularization == 'lasso':
                dw += self.lambda_ * np.sign(self.weights)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
            
            # Check convergence
            if i > 0 and abs(self.loss_history[-1] - self.loss_history[-2]) < 1e-6:
                print(f"Converged at iteration {i}")
                break
        
        return self
    
    def predict(self, X):
        """Make predictions"""
        return X @ self.weights + self.bias
    
    def score(self, X, y):
        """Calculate R² score"""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred)**2)
        ss_tot = np.sum((y - np.mean(y))**2)
        return 1 - (ss_res / ss_tot)


# Example usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    X = np.random.randn(100, 2)  # NDVI, Slope
    true_weights = np.array([-300, 45])
    true_bias = 200
    y = X @ true_weights + true_bias + np.random.randn(100) * 20
    
    # Split train/test
    n_train = 80
    X_train, X_test = X[:n_train], X[n_train:]
    y_train, y_test = y[:n_train], y[n_train:]
    
    # Train model
    model = LinearRegression(learning_rate=0.01, n_iterations=1000,
                            regularization='ridge', lambda_=0.1)
    model.fit(X_train, y_train)
    
    # Evaluate
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    print(f"Learned weights: {model.weights}")
    print(f"Learned bias: {model.bias:.2f}")
    print(f"Train R²: {train_score:.3f}")
    print(f"Test R²: {test_score:.3f}")
    
    # Predictions
    y_pred = model.predict(X_test)
    rmse = np.sqrt(np.mean((y_test - y_pred)**2))
    print(f"Test RMSE: {rmse:.2f}")

Runtime: Training <1 second for 100 samples, 2 features

Memory: Minimal (~1 KB for this dataset)


Tier 2: Professional Path (Go Further)

Scikit-learn with cross-validation:

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Load larger dataset (assume 10,000 samples, 20 features)
# X_large, y_large

# Set up spatial cross-validation
# (In reality, would use spatial blocks)
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter grid
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
}

# Grid search for ridge regression
ridge = Ridge()
grid_search = GridSearchCV(ridge, param_grid, cv=cv, 
                          scoring='neg_mean_squared_error',
                          n_jobs=-1)
grid_search.fit(X_large, y_large)

print(f"Best lambda: {grid_search.best_params_['alpha']}")
print(f"Best CV score: {-grid_search.best_score_:.2f} RMSE")

# Train final model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Detailed evaluation
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"Test RMSE: {rmse:.2f}")
print(f"Test R²: {r2:.3f}")

# Feature importance (coefficient magnitudes)
importance = np.abs(best_model.coef_)
for i, imp in enumerate(importance):
    print(f"Feature {i}: {imp:.3f}")

Runtime: 10-30 seconds for 10k samples with grid search

With GPU: XGBoost or LightGBM for larger datasets


6. Interpretation

Real Application: Digital Soil Mapping

GlobalSoilMap project:

Predict soil properties (carbon, pH, texture) globally at 90m resolution.

Training data: - ~50,000 soil profile observations worldwide - Soil databases (ISRIC, USDA)

Features: - Landsat spectral bands - Elevation, slope, curvature - Climate (temperature, precipitation) - Land cover - Parent material geology

Target: Soil organic carbon (SOC) at 0-30 cm depth

Model: Random forest regression (ensemble of decision trees)

Performance: - RMSE: 10-15 g C/kg soil - R²: 0.40-0.60 (moderate, high natural variability)

Insights from coefficients:

Positive predictors (higher SOC): - Higher precipitation (+) - Lower temperature (+) [decomposition slower when cold] - Forest land cover (+) - Higher clay content (+) [protects carbon]

Negative predictors (lower SOC): - Higher elevation (-) [thin soils] - Steeper slopes (-) [erosion] - Cropland (-) [tillage oxidizes carbon]

Residual Spatial Autocorrelation

Problem:

Regression residuals (errors) spatially clustered.

Indicates missing spatial structure in model.

Moran’s I test:

I = \frac{n}{W}\frac{\sum_i\sum_j w_{ij}(e_i - \bar{e})(e_j - \bar{e})}{\sum_i (e_i - \bar{e})^2}

Where: - e_i = residual at location i - w_{ij} = spatial weights (1 if neighbors, 0 otherwise) - W = sum of all weights

I > 0: Positive autocorrelation (clustered errors)
I \approx 0: No autocorrelation
I < 0: Negative autocorrelation (dispersed errors)

Solutions: 1. Add spatial features (coordinates, distance to…) 2. Use spatial regression models (SAR, CAR) 3. Ensemble predictions with kriging residuals

Prediction Uncertainty

Point predictions insufficient.

Need: Prediction intervals.

Bootstrap method:

  1. Resample training data with replacement (1000 times)
  2. Train model on each bootstrap sample
  3. Predict test point with all 1000 models
  4. Calculate 95% interval from predictions

Example:

Predicted elevation: 750m
95% interval: [720m, 780m]
Uncertainty: ±30m

Wider intervals when: - Extrapolating beyond training data - High noise in training data - Few training samples near test location


7. What Could Go Wrong?

Multicollinearity

Problem:

Features highly correlated (e.g., NDVI and NIR: r = 0.95).

Coefficient estimates unstable, variance inflated.

Symptoms: - Coefficients change dramatically with small data changes - Large standard errors - Counterintuitive coefficient signs

Solution: - Ridge regression (L2 penalty stabilizes) - Remove correlated features (keep one from each pair) - PCA to decorrelate features (Model 73)

Extrapolation

Problem:

Predict outside range of training data.

Linear model extrapolates to absurdity.

Example:

Training elevation: 200-1000m
Test location: 1500m (outside range)
Prediction: May be wildly inaccurate

Solution: - Flag predictions outside training range - Use domain knowledge constraints - Non-linear models (polynomial, splines)

Non-linear Relationships

Problem:

True relationship non-linear, linear model fits poorly.

Example:

Temperature vs elevation: T = T_0 - \Gamma \times elevation

Linear works.

Soil moisture vs precipitation: Saturates at high rainfall.

Linear fails (needs logistic or power transform).

Solution: - Feature engineering (polynomial terms, interactions) - Non-linear models (decision trees, neural networks) - Transform target (log, sqrt)

Heteroscedasticity

Problem:

Error variance not constant.

Example:

High elevation: ±50m error
Low elevation: ±10m error

Violates regression assumptions.

Detection:

Plot residuals vs predictions. Funnel shape = heteroscedasticity.

Solution: - Transform target (log, Box-Cox) - Weighted least squares - Robust regression (Huber loss)


8. Extension: Spatial Cross-Validation

Why Standard CV Fails

Spatial autocorrelation:

Nearby locations similar (Tobler’s Law).

Standard random CV: Test points near training points.

Overestimates accuracy (information leakage).

Spatial Block CV

Method:

  1. Divide region into spatial blocks (e.g., 10×10 grid)
  2. Each fold = contiguous block(s)
  3. Train on distant blocks, test on held-out block

Effect:

Test data truly independent (spatially).

Realistic accuracy estimate.

Typical result:

Standard CV: R² = 0.85
Spatial CV: R² = 0.65

Difference reveals spatial leakage.


9. Math Refresher: Matrix Calculus

Gradient of Quadratic Form

Function:

f(\mathbf{x}) = \mathbf{x}^T\mathbf{A}\mathbf{x}

Gradient:

\nabla f = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}

If \mathbf{A} symmetric: \nabla f = 2\mathbf{A}\mathbf{x}

Example:

L(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})

Expand:

= \mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}

Gradient:

\nabla L = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}


Summary