modelling Level 3

Spatial Joins and Proximity Queries

Which census tract contains each address? What's the nearest fire station to each school? Spatial joins link datasets by location rather than shared IDs. This model derives spatial join algorithms, introduces spatial indexing with R-trees and k-d trees, and implements nearest-neighbor queries efficiently.

Prerequisites: spatial indexing, nearest neighbor, k d trees, distance queries

27 Feb 2026 Updated 12 Mar 2026 15 min read

1. The Question

Which hospital is closest to each emergency call location?

Traditional database joins use attribute matching:

SELECT * FROM customers
JOIN orders ON customers.id = orders.customer_id

Spatial joins use location:

SELECT * FROM addresses
JOIN census_tracts ON ST_Contains(census_tracts.geom, addresses.point)

Key operations:

Contains: Which polygon contains each point?
Intersects: Which features overlap?
Within distance: Which features are near each other?
Nearest neighbor: What’s the closest feature?

The mathematical question: How do we efficiently find spatial relationships between thousands or millions of features?

The challenge: Naive approach tests every pair → $O(n \times m)$ complexity → too slow!

2. The Conceptual Model

Types of Spatial Joins

1. Point-in-Polygon Join

Assign each point to its containing polygon.

Example: Geocode addresses to census tracts

Input A: 100,000 address points
Input B: 5,000 census tract polygons
Output: Each address tagged with its tract ID

2. Intersection Join

Find all pairs of features that overlap.

Example: Parcels affected by flood zones

Input A: 50,000 property parcels
Input B: 200 flood zone polygons
Output: Parcels that intersect any flood zone

3. Distance Join

Find all pairs within distance threshold.

Example: Homes within 1km of schools

Input A: 200,000 homes
Input B: 500 schools
Output: Home-school pairs where distance ≤ 1000m

4. Nearest Neighbor Join

For each feature in A, find closest feature in B.

Example: Nearest hospital to each emergency call

Input A: 10,000 emergency calls
Input B: 50 hospitals
Output: Each call assigned to nearest hospital

3. Building the Mathematical Model

Naive Algorithm (Too Slow)

Point-in-polygon join:

for each point in A:
    for each polygon in B:
        if pointInPolygon(point, polygon):
            assign point to polygon
            break

Complexity: $O(n \times m)$ where $n =

$, $m =

Example: 100,000 points × 5,000 polygons = 500 million tests!

At 1 microsecond per test: 500 seconds = 8.3 minutes

Problem: Unacceptably slow for large datasets.

Spatial Indexing: R-tree

Idea: Organize spatial data into hierarchical bounding boxes.

R-tree structure:

Each node: Minimum Bounding Rectangle (MBR) containing child nodes
Leaf nodes: Actual geometries
Internal nodes: MBRs that bound groups of features

Minimum Bounding Rectangle (MBR):

For geometry with vertices $(x_i, y_i)$:

\[\text{MBR} = [\min(x_i), \min(y_i), \max(x_i), \max(y_i)]\]

Or: $(x_{\min}, y_{\min}, x_{\max}, y_{\max})$

MBR intersection test (fast):

Two MBRs $R_1 = (x_{\min}^1, y_{\min}^1, x_{\max}^1, y_{\max}^1)$ and $R_2 = (x_{\min}^2, y_{\min}^2, x_{\max}^2, y_{\max}^2)$ intersect if:

\[x_{\min}^1 \leq x_{\max}^2 \text{ AND } x_{\max}^1 \geq x_{\min}^2\] \[y_{\min}^1 \leq y_{\max}^2 \text{ AND } y_{\max}^1 \geq y_{\min}^2\]

Four comparisons vs. expensive polygon-polygon intersection.

Join algorithm with R-tree:

Build R-tree index for dataset B
for each feature a in A:
    query R-tree with a's MBR
    for each candidate b returned:
        if actualIntersection(a, b):
            output (a, b)

Complexity: $O(n \log m)$ average case (much better!)

Speedup: 100,000 × log₂(5000) ≈ 1.2 million tests vs. 500 million → 400× faster

K-d Tree for Nearest Neighbor

Problem: Find nearest point in set B to query point $q$.

K-d tree: Binary tree that partitions k-dimensional space.

For 2D (k=2):

Root level: split by x-coordinate (median)
Level 1: split by y-coordinate
Level 2: split by x-coordinate
Alternates…

Construction:

function buildKDTree(points, depth):
    if points is empty: return null
    
    axis = depth mod 2  # 0 = x, 1 = y
    median = selectMedian(points, axis)
    
    node.point = median
    node.left = buildKDTree(points < median, depth+1)
    node.right = buildKDTree(points > median, depth+1)
    
    return node

Nearest neighbor search:

function nearestNeighbor(tree, query):
    best = null
    bestDist = infinity
    
    function search(node, depth):
        if node is null: return
        
        dist = distance(query, node.point)
        if dist < bestDist:
            best = node.point
            bestDist = dist
        
        axis = depth mod 2
        diff = query[axis] - node.point[axis]
        
        # Search near side first
        if diff < 0:
            search(node.left, depth+1)
            if abs(diff) < bestDist:
                search(node.right, depth+1)
        else:
            search(node.right, depth+1)
            if abs(diff) < bestDist:
                search(node.left, depth+1)
    
    search(tree, 0)
    return best

Complexity: $O(\log n)$ average case → huge speedup for large datasets!

Example: Finding nearest hospital among 1 million hospitals:

Linear scan: 1 million distance calculations
K-d tree: ~20 distance calculations (log₂(1,000,000) ≈ 20)
50,000× faster!

4. Worked Example by Hand

Problem: Find which census tract contains each address.

Addresses (points):

A1: (2, 3)
A2: (7, 8)
A3: (5, 2)

Census tracts (polygons):

T1: Square from (0,0) to (5,5)
T2: Square from (5,0) to (10,5)
T3: Square from (0,5) to (5,10)
T4: Square from (5,5) to (10,10)

Solution

Step 1: Build spatial index (manually)

MBRs for tracts:

T1: [0, 0, 5, 5]
T2: [5, 0, 10, 5]
T3: [0, 5, 5, 10]
T4: [5, 5, 10, 10]

Step 2: Test each address

A1 = (2, 3):

Candidates (MBR test):

T1: 0 ≤ 2 ≤ 5 and 0 ≤ 3 ≤ 5 → Yes
T2: 5 ≤ 2 ≤ 10? No (fails x test)
T3: 0 ≤ 2 ≤ 5 and 5 ≤ 3 ≤ 10? No (fails y test)
T4: No

Point-in-polygon test for T1:

(2,3) inside square [0,0,5,5]? Yes

Result: A1 → T1

A2 = (7, 8):

Candidates:

T1: No (x > 5)
T2: No (y > 5)
T3: No (x > 5)
T4: 5 ≤ 7 ≤ 10 and 5 ≤ 8 ≤ 10 → Yes

Point-in-polygon for T4: Yes

Result: A2 → T4

A3 = (5, 2):

Candidates:

T1: 0 ≤ 5 ≤ 5 and 0 ≤ 2 ≤ 5 → Yes
T2: 5 ≤ 5 ≤ 10 and 0 ≤ 2 ≤ 5 → Yes

Edge case! Point is exactly on boundary between T1 and T2.

Point-in-polygon tests:

T1: (5,2) on right edge → convention: Not inside (use < not ≤)
T2: (5,2) on left edge → convention: Inside (use ≤)

Result: A3 → T2

Final join results:

A1 (2,3) → T1
A2 (7,8) → T4
A3 (5,2) → T2

5. Computational Implementation

Below is an interactive spatial join demonstration.

Join type: Distance threshold (for distance join): 80 px Show spatial index:

Query features: --

Target features: --

Join results: --

Tests performed: --

Try this:

Point-in-polygon: Green polygons contain red points
Nearest neighbor: Blue lines connect each point to nearest polygon
Within distance: Purple circles show search radius, lines show matches
Show spatial index: See MBR bounding boxes (dashed)
Adjust distance: Watch connections appear/disappear
Notice: Test count stays low because spatial index filters candidates

Key insight: Spatial indexing makes spatial joins practical—without it, large datasets would be impossibly slow.

6. Interpretation

Real-World Performance

Without spatial index:

1 million addresses × 50,000 census tracts = 50 billion tests
At 1 μs per test: 50,000 seconds = 13.9 hours

With R-tree index:

1 million addresses × log₂(50,000) ≈ 16 million tests
At 1 μs per test: 16 seconds

Speedup: 3,000×

This is why GIS software creates spatial indexes automatically.

Geocoding

Address geocoding = spatial join:

Standardize address: "123 Main St" → structured fields
Fuzzy match to street centerline database
Interpolate position along street segment
Join to census geography to get tract/block

Spatial join: Point (geocoded address) → Polygon (census unit)

Service Area Analysis

Problem: Assign customers to service territories.

Solution: Nearest facility join

Customers = points
Facilities = points (stores, fire stations, hospitals)
Join type: Nearest neighbor

Result: Voronoi diagram (each customer assigned to nearest facility)

Demographic Enrichment

Add demographic data to customer records:

customers (lat, lon, name, ...)
JOIN census_tracts (polygon, population, income, ...)
WHERE ST_Contains(census_tracts.polygon, customers.point)

Result: customers enriched with (population, income, ...)

Business intelligence: Targeted marketing based on neighborhood demographics.

7. What Could Go Wrong?

Boundary Effects

Point exactly on polygon boundary:

May be assigned to either polygon (undefined).

Solution: Convention (use < or ≤ consistently) or assign to nearest centroid.

Multiple polygons contain point:

Overlapping polygons (e.g., political vs. statistical boundaries).

Solution:

Return all matches (one-to-many join)
Or prioritize by attribute (e.g., smallest polygon)

Performance Degradation

Clustered data:

If all features are in one spatial region, R-tree becomes unbalanced.

Example: All points in one city, but index covers entire country.

Solution: Adjust R-tree parameters (min/max entries per node).

High join selectivity:

If every point matches many polygons (within-distance with large threshold).

Result: Index provides little filtering → back to $O(n \times m)$ performance.

Coordinate System Mismatch

Features in different projections:

Points in WGS84 (lat/lon)
Polygons in UTM Zone 10 (meters)

Spatial operations fail (comparing apples to oranges).

Solution: Reproject to common coordinate system before joining.

Topology Errors

Invalid polygons:

Self-intersecting boundaries
Holes not properly encoded
Duplicate vertices

Result: Point-in-polygon tests give wrong results.

Solution: Validate and repair geometries before join.

8. Extension: Spatial Databases

PostGIS extends PostgreSQL with spatial types and operations:

-- Create spatial index
CREATE INDEX parcels_geom_idx 
ON parcels USING GIST (geom);

-- Point-in-polygon join
SELECT p.address, c.name
FROM parcels p
JOIN census_tracts c ON ST_Contains(c.geom, p.geom);

-- Within distance join
SELECT h.id, s.name
FROM homes h
JOIN schools s ON ST_DWithin(h.geom, s.geom, 1000);

-- Nearest neighbor (PostgreSQL 9.5+)
SELECT DISTINCT ON (h.id)
    h.id, s.name, ST_Distance(h.geom, s.geom) as dist
FROM homes h
CROSS JOIN LATERAL (
    SELECT * FROM schools s
    ORDER BY h.geom <-> s.geom
    LIMIT 1
) s;

GIST index = R-tree implementation in PostgreSQL.

Performance: Handles millions of features efficiently.

9. Math Refresher: Spatial Indexing

Bounding Box Intersection

Two rectangles intersect if they overlap in both dimensions.

Rectangle 1: $[x_1^{\min}, x_1^{\max}] \times [y_1^{\min}, y_1^{\max}]$
Rectangle 2: $[x_2^{\min}, x_2^{\max}] \times [y_2^{\min}, y_2^{\max}]$

X-overlap:

\[x_1^{\max} \geq x_2^{\min} \text{ AND } x_1^{\min} \leq x_2^{\max}\]

Y-overlap:

\[y_1^{\max} \geq y_2^{\min} \text{ AND } y_1^{\min} \leq y_2^{\max}\]

Intersect: Both overlap

Disjoint: Either doesn’t overlap

Binary Search Trees

Property: For each node with value $v$:

Left subtree: all values < $v$
Right subtree: all values ≥ $v$

Search complexity: $O(\log n)$ for balanced tree

K-d tree extends this to k dimensions:

Each level splits on one dimension
Alternates dimensions: x, y, x, y, …

Median selection ensures balance.

Voronoi Diagrams

Definition: Partition of space into regions, each containing all points closest to one site.

Voronoi cell for site $s_i$:

\[V(s_i) = \{p : d(p, s_i) \leq d(p, s_j) \text{ for all } j \neq i\}\]

Construction: Nearest neighbor join creates Voronoi assignment.

Applications:

Service territories
Catchment areas
Thiessen polygons (rainfall interpolation)

Summary

Spatial joins link datasets by location, not by attribute matching
Types: Point-in-polygon, intersection, within distance, nearest neighbor
Naive algorithm ($O(n \times m)$) too slow for large datasets
R-tree spatial index filters candidates using MBR tests → $O(n \log m)$
K-d tree enables efficient nearest neighbor search → $O(\log n)$
Applications: Geocoding, service area analysis, demographic enrichment
Challenges: Boundary effects, coordinate system mismatch, topology errors
Spatial databases (PostGIS) handle millions of features efficiently
Bounding box tests much faster than full geometry operations