Linking datasets by location — the foundation of spatial analysis
2026-02-27
When epidemiologists investigated the outbreak of cholera in London’s Soho district in 1854, the key analytical step was linking two datasets that had no shared identifier: the locations of cholera deaths and the locations of water pumps. John Snow’s famous dot map accomplished this visually — placing dots where people died and observing their spatial clustering around the Broad Street pump. A modern GIS would do it computationally, joining each death location to the nearest pump (a nearest-neighbour spatial join) and computing summary statistics by pump service area. Snow’s map is celebrated as the birth of epidemiology; the underlying operation is a spatial join.
A spatial join links records from two datasets based on their geographic relationship rather than a shared attribute. Traditional database joins match on a key — customer ID, parcel number, census tract code. Spatial joins match on location: this point is within that polygon, this address is within 500 metres of that facility, this census tract shares a boundary with those neighbouring tracts. The spatial predicates (contains, intersects, within, nearest) map to different geometric tests, each with different computational cost. Naive implementations test every pair of features — O(n²) for n features on each side — which becomes unworkable for large datasets. Spatial indexing structures, particularly R-trees and k-d trees, reduce this to O(n log n) by allowing fast elimination of features that cannot possibly match. This model derives the geometric tests, builds the most common spatial join types, and introduces spatial indexing.
Which hospital is closest to each emergency call location?
Traditional database joins use attribute matching:
SELECT * FROM customers
JOIN orders ON customers.id = orders.customer_idSpatial joins use location:
SELECT * FROM addresses
JOIN census_tracts ON ST_Contains(census_tracts.geom, addresses.point)Key operations: - Contains: Which polygon contains each point? - Intersects: Which features overlap? - Within distance: Which features are near each other? - Nearest neighbor: What’s the closest feature?
The mathematical question: How do we efficiently find spatial relationships between thousands or millions of features?
The challenge: Naive approach tests every pair → O(n \times m) complexity → too slow!
1. Point-in-Polygon Join
Assign each point to its containing polygon.
Example: Geocode addresses to census tracts - Input A: 100,000 address points - Input B: 5,000 census tract polygons - Output: Each address tagged with its tract ID
2. Intersection Join
Find all pairs of features that overlap.
Example: Parcels affected by flood zones - Input A: 50,000 property parcels - Input B: 200 flood zone polygons - Output: Parcels that intersect any flood zone
3. Distance Join
Find all pairs within distance threshold.
Example: Homes within 1km of schools - Input A: 200,000 homes - Input B: 500 schools - Output: Home-school pairs where distance ≤ 1000m
4. Nearest Neighbor Join
For each feature in A, find closest feature in B.
Example: Nearest hospital to each emergency call - Input A: 10,000 emergency calls - Input B: 50 hospitals - Output: Each call assigned to nearest hospital
Point-in-polygon join:
for each point in A:
for each polygon in B:
if pointInPolygon(point, polygon):
assign point to polygon
break
Complexity: O(n \times m) where n = |A|, m = |B|
Example: 100,000 points × 5,000 polygons = 500 million tests!
At 1 microsecond per test: 500 seconds = 8.3 minutes
Problem: Unacceptably slow for large datasets.
Idea: Organize spatial data into hierarchical bounding boxes.
R-tree structure: - Each node: Minimum Bounding Rectangle (MBR) containing child nodes - Leaf nodes: Actual geometries - Internal nodes: MBRs that bound groups of features
Minimum Bounding Rectangle (MBR):
For geometry with vertices (x_i, y_i):
\text{MBR} = [\min(x_i), \min(y_i), \max(x_i), \max(y_i)]
Or: (x_{\min}, y_{\min}, x_{\max}, y_{\max})
MBR intersection test (fast):
Two MBRs R_1 = (x_{\min}^1, y_{\min}^1, x_{\max}^1, y_{\max}^1) and R_2 = (x_{\min}^2, y_{\min}^2, x_{\max}^2, y_{\max}^2) intersect if:
x_{\min}^1 \leq x_{\max}^2 \text{ AND } x_{\max}^1 \geq x_{\min}^2
y_{\min}^1 \leq y_{\max}^2 \text{ AND } y_{\max}^1 \geq y_{\min}^2
Four comparisons vs. expensive polygon-polygon intersection.
Join algorithm with R-tree:
1. Build R-tree index for dataset B
2. for each feature a in A:
3. query R-tree with a's MBR
4. for each candidate b returned:
5. if actualIntersection(a, b):
6. output (a, b)
Complexity: O(n \log m) average case (much better!)
Speedup: 100,000 × log₂(5000) ≈ 1.2 million tests vs. 500 million → 400× faster
Problem: Find nearest point in set B to query point q.
K-d tree: Binary tree that partitions k-dimensional space.
For 2D (k=2): - Root level: split by x-coordinate (median) - Level 1: split by y-coordinate - Level 2: split by x-coordinate - Alternates…
Construction:
function buildKDTree(points, depth):
if points is empty: return null
axis = depth mod 2 # 0 = x, 1 = y
median = selectMedian(points, axis)
node.point = median
node.left = buildKDTree(points < median, depth+1)
node.right = buildKDTree(points > median, depth+1)
return node
Nearest neighbor search:
function nearestNeighbor(tree, query):
best = null
bestDist = infinity
function search(node, depth):
if node is null: return
dist = distance(query, node.point)
if dist < bestDist:
best = node.point
bestDist = dist
axis = depth mod 2
diff = query[axis] - node.point[axis]
# Search near side first
if diff < 0:
search(node.left, depth+1)
if abs(diff) < bestDist:
search(node.right, depth+1)
else:
search(node.right, depth+1)
if abs(diff) < bestDist:
search(node.left, depth+1)
search(tree, 0)
return best
Complexity: O(\log n) average case → huge speedup for large datasets!
Example: Finding nearest hospital among 1 million hospitals: - Linear scan: 1 million distance calculations - K-d tree: ~20 distance calculations (log₂(1,000,000) ≈ 20) - 50,000× faster!
Problem: Find which census tract contains each address.
Addresses (points): - A1: (2, 3) - A2: (7, 8) - A3: (5, 2)
Census tracts (polygons): - T1: Square from (0,0) to (5,5) - T2: Square from (5,0) to (10,5) - T3: Square from (0,5) to (5,10) - T4: Square from (5,5) to (10,10)
Step 1: Build spatial index (manually)
MBRs for tracts: - T1: [0, 0, 5, 5] - T2: [5, 0, 10, 5] - T3: [0, 5, 5, 10] - T4: [5, 5, 10, 10]
Step 2: Test each address
A1 = (2, 3):
Candidates (MBR test): - T1: 0 ≤ 2 ≤ 5 and 0 ≤ 3 ≤ 5 → Yes - T2: 5 ≤ 2 ≤ 10? No (fails x test) - T3: 0 ≤ 2 ≤ 5 and 5 ≤ 3 ≤ 10? No (fails y test) - T4: No
Point-in-polygon test for T1: - (2,3) inside square [0,0,5,5]? Yes
Result: A1 → T1
A2 = (7, 8):
Candidates: - T1: No (x > 5) - T2: No (y > 5) - T3: No (x > 5) - T4: 5 ≤ 7 ≤ 10 and 5 ≤ 8 ≤ 10 → Yes
Point-in-polygon for T4: Yes
Result: A2 → T4
A3 = (5, 2):
Candidates: - T1: 0 ≤ 5 ≤ 5 and 0 ≤ 2 ≤ 5 → Yes - T2: 5 ≤ 5 ≤ 10 and 0 ≤ 2 ≤ 5 → Yes
Edge case! Point is exactly on boundary between T1 and T2.
Point-in-polygon tests: - T1: (5,2) on right edge → convention: Not inside (use < not ≤) - T2: (5,2) on left edge → convention: Inside (use ≤)
Result: A3 → T2
Final join results: - A1 (2,3) → T1 - A2 (7,8) → T4 - A3 (5,2) → T2
Below is an interactive spatial join demonstration.
<label>
Join type:
<select id="join-type">
<option value="point-in-polygon" selected>Point-in-Polygon</option>
<option value="nearest-neighbor">Nearest Neighbor</option>
<option value="within-distance">Within Distance</option>
</select>
</label>
<label>
Distance threshold (for distance join):
<input type="range" id="dist-threshold" min="20" max="150" step="10" value="80">
<span id="dist-value">80</span> px
</label>
<label>
Show spatial index:
<input type="checkbox" id="show-index">
</label>
<div class="join-info">
<p><strong>Query features:</strong> <span id="query-count">--</span></p>
<p><strong>Target features:</strong> <span id="target-count">--</span></p>
<p><strong>Join results:</strong> <span id="result-count">--</span></p>
<p><strong>Tests performed:</strong> <span id="test-count">--</span></p>
</div>
<canvas id="join-canvas" width="600" height="500" style="border: 1px solid #ddd;"></canvas>
Try this: - Point-in-polygon: Green polygons contain red points - Nearest neighbor: Blue lines connect each point to nearest polygon - Within distance: Purple circles show search radius, lines show matches - Show spatial index: See MBR bounding boxes (dashed) - Adjust distance: Watch connections appear/disappear - Notice: Test count stays low because spatial index filters candidates
Key insight: Spatial indexing makes spatial joins practical—without it, large datasets would be impossibly slow.
Without spatial index: - 1 million addresses × 50,000 census tracts = 50 billion tests - At 1 μs per test: 50,000 seconds = 13.9 hours
With R-tree index: - 1 million addresses × log₂(50,000) ≈ 16 million tests - At 1 μs per test: 16 seconds
Speedup: 3,000×
This is why GIS software creates spatial indexes automatically.
Address geocoding = spatial join:
1. Standardize address: "123 Main St" → structured fields
2. Fuzzy match to street centerline database
3. Interpolate position along street segment
4. Join to census geography to get tract/block
Spatial join: Point (geocoded address) → Polygon (census unit)
Problem: Assign customers to service territories.
Solution: Nearest facility join - Customers = points - Facilities = points (stores, fire stations, hospitals) - Join type: Nearest neighbor
Result: Voronoi diagram (each customer assigned to nearest facility)
Add demographic data to customer records:
customers (lat, lon, name, ...)
JOIN census_tracts (polygon, population, income, ...)
WHERE ST_Contains(census_tracts.polygon, customers.point)
Result: customers enriched with (population, income, ...)
Business intelligence: Targeted marketing based on neighborhood demographics.
Point exactly on polygon boundary:
May be assigned to either polygon (undefined).
Solution: Convention (use < or ≤ consistently) or assign to nearest centroid.
Multiple polygons contain point:
Overlapping polygons (e.g., political vs. statistical boundaries).
Solution: - Return all matches (one-to-many join) - Or prioritize by attribute (e.g., smallest polygon)
Clustered data:
If all features are in one spatial region, R-tree becomes unbalanced.
Example: All points in one city, but index covers entire country.
Solution: Adjust R-tree parameters (min/max entries per node).
High join selectivity:
If every point matches many polygons (within-distance with large threshold).
Result: Index provides little filtering → back to O(n \times m) performance.
Features in different projections:
Points in WGS84 (lat/lon)
Polygons in UTM Zone 10 (meters)
Spatial operations fail (comparing apples to oranges).
Solution: Reproject to common coordinate system before joining.
Invalid polygons: - Self-intersecting boundaries - Holes not properly encoded - Duplicate vertices
Result: Point-in-polygon tests give wrong results.
Solution: Validate and repair geometries before join.
PostGIS extends PostgreSQL with spatial types and operations:
-- Create spatial index
CREATE INDEX parcels_geom_idx
ON parcels USING GIST (geom);
-- Point-in-polygon join
SELECT p.address, c.name
FROM parcels p
JOIN census_tracts c ON ST_Contains(c.geom, p.geom);
-- Within distance join
SELECT h.id, s.name
FROM homes h
JOIN schools s ON ST_DWithin(h.geom, s.geom, 1000);
-- Nearest neighbor (PostgreSQL 9.5+)
SELECT DISTINCT ON (h.id)
h.id, s.name, ST_Distance(h.geom, s.geom) as dist
FROM homes h
CROSS JOIN LATERAL (
SELECT * FROM schools s
ORDER BY h.geom <-> s.geom
LIMIT 1
) s;GIST index = R-tree implementation in PostgreSQL.
Performance: Handles millions of features efficiently.
Two rectangles intersect if they overlap in both dimensions.
Rectangle 1: [x_1^{\min}, x_1^{\max}] \times [y_1^{\min}, y_1^{\max}]
Rectangle 2: [x_2^{\min}, x_2^{\max}] \times [y_2^{\min}, y_2^{\max}]
X-overlap:
x_1^{\max} \geq x_2^{\min} \text{ AND } x_1^{\min} \leq x_2^{\max}
Y-overlap:
y_1^{\max} \geq y_2^{\min} \text{ AND } y_1^{\min} \leq y_2^{\max}
Intersect: Both overlap
Disjoint: Either doesn’t overlap
Property: For each node with value v: - Left subtree: all values < v - Right subtree: all values ≥ v
Search complexity: O(\log n) for balanced tree
K-d tree extends this to k dimensions: - Each level splits on one dimension - Alternates dimensions: x, y, x, y, …
Median selection ensures balance.
Definition: Partition of space into regions, each containing all points closest to one site.
Voronoi cell for site s_i:
V(s_i) = \{p : d(p, s_i) \leq d(p, s_j) \text{ for all } j \neq i\}
Construction: Nearest neighbor join creates Voronoi assignment.
Applications: - Service territories - Catchment areas - Thiessen polygons (rainfall interpolation)