What cell size should I use for a 500 m urban grid?

500 m cells work well for dense urban datasets where k ≥ 5 is routinely achieved. For suburban or rural data where density is low, increase to 1 000–2 000 m or switch to adaptive quadtree subdivision so sparse zones do not force suppression of entire regions.

Does grid aggregation satisfy GDPR Article 5(1)(c) data minimisation?

Yes, when configured correctly. Replacing precise coordinates with a cell centroid removes the exact-location attribute. Pair this with a suppression rule (k ≥ 5) and a documented audit trail to satisfy the accountability principle under Article 5(2).

Should I use square or hexagonal cells?

Square grids are simpler to generate and align with raster workflows; hexagonal grids reduce edge effects and equalize neighbour distances, which improves spatial autocorrelation metrics. For hotspot analysis or mobility heatmaps, hexagons typically outperform squares on utility metrics.

How do I handle points exactly on cell boundaries?

Enforce a strict ST_Contains predicate with a consistent tie-breaking rule (e.g., assign to the lower-index cell). Avoid ST_Intersects alone, which will double-count boundary points and inflate density estimates.

Grid Aggregation & Spatial Binning Strategies

Grid aggregation partitions continuous coordinate space into a uniform tessellation of discrete cells, then publishes each cell’s centroid and record count instead of individual point coordinates — eliminating precise location while preserving the spatial distribution needed for analysis.

When to Use Grid Aggregation vs. Alternatives permalink

Not every privacy requirement calls for grid aggregation. The diagram below maps the key decision factors.

Use grid aggregation when density counts are sufficient; fall back to coordinate jittering and noise injection, Laplace/Gaussian DP noise, or spatial fuzzing when individual records or formal guarantees are required.

Algorithmic Specification permalink

Cell Assignment Rule permalink

Given a point $p = (x, y)$ in a projected metric CRS and a cell side length $c$ (metres), the integer grid indices $(i, j)$ are:

i = \left\lfloor \frac{x - x_{\min}}{c} \right\rfloor, \quad j = \left\lfloor \frac{y - y_{\min}}{c} \right\rfloor

The cell is the half-open rectangle $[i \cdot c + x_{\min},\ (i+1) \cdot c + x_{\min}) \times [j \cdot c + y_{\min},\ (j+1) \cdot c + y_{\min})$ . The released centroid is:

\hat{p} = \left( \left(i + \tfrac{1}{2}\right) c + x_{\min},\ \left(j + \tfrac{1}{2}\right) c + y_{\min} \right)

Maximum centroid displacement from the true point is bounded by the cell diagonal:

d_{\max} = \frac{c\sqrt{2}}{2}

k-Anonymity Suppression permalink

A cell is released only when its record count $n_{\text{cell}} \geq k$ . Suppressed cells reveal no count, no centroid, and no existence indicator. The overall k-anonymity guarantee for the published dataset is:

k_{\text{eff}} = \min_{j \in \text{released cells}} n_j

Parameter Reference permalink

Parameter	Typical range	Privacy effect	Utility effect
Cell size `c`	100 m – 5 000 m	Larger → stronger anonymity	Larger → lower spatial resolution
k-anonymity floor `k`	3 – 10	Higher → fewer released cells	Higher → larger data gaps
Grid origin $(x_{\min}, y_{\min})$	Fixed at dataset extent	Must be versioned	Determines cell alignment
CRS	Equal-area or UTM	Projection error leaks location	Affects distance accuracy

Prerequisites & Data Requirements permalink

Before running a grid aggregation pipeline, confirm all of the following:

Projected metric CRS: Input geometries must be in an equal-area or locally optimized metric CRS — EPSG:3857 for web mapping, or a region-specific UTM zone for high-accuracy work. Latitude/longitude (EPSG:4326) produces inconsistent cell areas at higher latitudes and must be reprojected before binning.
Geometry validity: Run shapely.validation.make_valid() (Python) or ST_MakeValid() (PostGIS) to eliminate null geometries, self-intersections, and topology errors that cause silent pipeline failures.
Baseline privacy thresholds: Define $k$ (minimum records per cell) and a suppression indicator (boolean, not a count) before processing begins. Compliance frameworks typically require $k \geq 5$ for public releases; health and demographic data commonly require $k \geq 10$ .
Minimum dataset size: Grids over fewer than $10 \times k$ total records will suppress the majority of cells, collapsing utility. Assess density before committing to a cell size.
Python dependencies: geopandas ≥ 0.14, shapely ≥ 2.0, numpy, pyproj. For database execution: PostGIS 3.1+ with ST_SquareGrid / ST_HexagonGrid.
Column schema: Input must include a unique record identifier, a geometry column in a projected CRS, and any attribute columns needed for post-aggregation analysis. Sensitive quasi-identifiers (age, gender, diagnosis) must be dropped or generalized before the point layer reaches the binning pipeline.

Step-by-Step Implementation permalink

Step 1 — Validate and Reproject Input permalink

import geopandas as gpd
import numpy as np
from shapely.geometry import box
from shapely.validation import make_valid

# Load sensitive point layer — already stripped of direct identifiers
points: gpd.GeoDataFrame = gpd.read_file("sensitive_locations.geojson")

# 1a. Drop invalid geometries before any spatial operation
points = points[points.geometry.notna()]
points["geometry"] = points["geometry"].apply(
    lambda g: make_valid(g) if not g.is_valid else g
)

# 1b. Reproject to metric CRS if needed (EPSG:3857 for global; prefer UTM for local)
if points.crs is None or points.crs.to_epsg() != 3857:
    points = points.to_crs(epsg=3857)

Reprojection is a privacy step, not just a geometry step: lat/lon grids produce cells of unequal area, so privacy guarantees vary across the dataset extent.

Step 2 — Generate the Grid Tessellation permalink

# Cell side length in metres. Privacy/utility trade-off determined here.
CELL_SIZE: int = 500   # 500 m → d_max ≈ 354 m; adjust per density analysis
K_FLOOR: int = 5       # Minimum records required to release a cell

bounds = points.total_bounds                       # (x_min, y_min, x_max, y_max)
x_min, y_min, x_max, y_max = bounds

# Build grid using list comprehension — deterministic, index-stable
grid_geoms = [
    box(x, y, x + CELL_SIZE, y + CELL_SIZE)
    for x in np.arange(x_min, x_max, CELL_SIZE)
    for y in np.arange(y_min, y_max, CELL_SIZE)
]
grid_gdf = gpd.GeoDataFrame(
    {"cell_id": range(len(grid_geoms))},
    geometry=grid_geoms,
    crs=points.crs,
)

Cache grid_gdf to disk (GeoPackage) so that all temporal snapshots of the dataset use the identical cell definitions — essential for longitudinal analysis.

Step 3 — Spatial Join and Count permalink

# ST_Contains semantics: assign each point to exactly one cell.
# points on the eastern/northern boundary are caught by cell_id tie-breaking (lowest index).
joined = gpd.sjoin(points, grid_gdf, how="inner", predicate="within")

count_by_cell = (
    joined.groupby("cell_id")
    .size()
    .reset_index(name="record_count")
)

Using predicate="within" implements strict containment, preventing boundary double-counts. Points exactly on a shared edge are assigned to the lower-index cell via the join order.

Step 4 — Apply k-Anonymity Suppression permalink

# Privacy filter: only release cells that meet the k floor.
# Do NOT release suppressed cell indices or zero counts — existence leakage.
released = count_by_cell[count_by_cell["record_count"] >= K_FLOOR].copy()

released["bin_centroid"] = grid_gdf.loc[
    released["cell_id"], "geometry"
].centroid.values

# Build output as a point layer of released centroids
output = gpd.GeoDataFrame(
    released[["cell_id", "record_count"]],
    geometry=released["bin_centroid"],
    crs=points.crs,
)

# Reproject centroids back to WGS 84 for downstream GeoJSON/tile publishing
output_wgs84 = output.to_crs(epsg=4326)
output_wgs84.to_file("anonymized_centroids.geojson", driver="GeoJSON")

The suppression step is the re-identification risk control. Never publish a list of suppressed cells, even with zero counts — their presence in the output is itself a disclosure.

Step 5 — PostGIS Native Execution (Production Scale) permalink

For datasets above 10 million records, move the pipeline into PostGIS. ST_SquareGrid (PostGIS 3.1+) generates the tessellation server-side, leveraging spatial indexes for the join.

-- All coordinates in EPSG:3857 before this query runs.
-- ST_SquareGrid(size_metres, extent_geom) returns (geom, i, j).
WITH grid AS (
  SELECT (ST_SquareGrid(
    500,
    ST_Transform(ST_SetSRID(ST_Extent(geom)::geometry, 4326), 3857)
  )).geom AS cell
  FROM sensitive_points
),
counts AS (
  SELECT
    g.cell,
    COUNT(p.id)  AS record_count
  FROM grid g
  JOIN sensitive_points p
    ON ST_Within(p.geom, g.cell)       -- strict containment, no double-count
  GROUP BY g.cell
)
SELECT
  ST_Transform(ST_Centroid(cell), 4326) AS bin_centroid_wgs84,
  record_count
FROM counts
WHERE record_count >= 5;               -- k-anonymity floor

For hexagonal tessellations that reduce edge effects in hotspot analysis, see the dedicated guide on implementing hexagonal grid aggregation in PostGIS.

Validation & Re-identification Testing permalink

Entropy Check on Published Centroids permalink

After publishing, verify that the output centroid layer does not allow auxiliary-join re-identification. For each released centroid, compute the distance to every point in a plausible auxiliary dataset (e.g., a public POI list). If any centroid falls within $d_{\max} = c\sqrt{2}/2$ of a unique auxiliary record, a spatial linkage attack can reconstruct the original point.

from scipy.spatial import KDTree
import numpy as np

# Simulate linkage attack: can an attacker match centroids to auxiliary POIs?
aux_coords = np.column_stack([aux_gdf.geometry.x, aux_gdf.geometry.y])
pub_coords  = np.column_stack([output_wgs84.geometry.x, output_wgs84.geometry.y])

tree = KDTree(aux_coords)
d_max = (CELL_SIZE * np.sqrt(2)) / 2   # theoretical maximum displacement (metres)

# Query: how many published centroids have a unique auxiliary match within d_max?
dists, idxs = tree.query(pub_coords, k=2)  # k=2 to detect uniqueness
unique_matches = np.sum(dists[:, 0] <= d_max)
print(f"Potential linkage matches: {unique_matches} / {len(pub_coords)}")
# Target: 0. Increase cell size or k until no unique matches remain.

Neighbour-Count Audit permalink

For each released cell, count how many neighbouring cells (Moore neighbourhood, 8 adjacent) are also released. Isolated released cells with no released neighbours warrant additional scrutiny: the absence of surrounding releases may itself indicate a sparse-but-identifiable population.

Moran’s I Utility Check permalink

Spatial autocorrelation (Moran’s I) measures whether the published density surface preserves the clustering structure of the raw data. Compute Moran’s I on both the raw counts (at fine resolution) and the aggregated counts; the ratio should remain above 0.80 for acceptable utility.

Common Failure Modes & Gotchas permalink

Projection mismatch causing unequal cell areas. If input data is in EPSG:4326 and you generate the grid before reprojecting, cells near the poles cover far more ground than cells at the equator — violating the equal-protection assumption of k-anonymity. Always reproject first.

Temporal drift breaking grid alignment. If bounding boxes are recomputed per time slice, each snapshot uses a slightly different origin $(x_{\min}, y_{\min})$ , shifting cell boundaries and making longitudinal comparison invalid. Fix: precompute a canonical grid extent covering the full observation period and reuse it across all snapshots.

Sparse-region utility collapse. Fixed cell sizes over low-density rural areas will suppress most cells, leaving blank regions that implicitly reveal population absence. Address this by hybridising with coordinate jittering and noise injection in sparse zones: add Gaussian noise to centroids of cells that barely meet $k$ , breaking exact centroid linkage while preserving count integrity.

Existence leakage from suppression indicators. Publishing a “suppressed” flag alongside released cells tells an attacker exactly where sparse populations live. Omit suppressed cells from the output entirely — their absence is the privacy control.

Floating-point precision in grid indices. np.arange with float steps accumulates rounding error. Use integer arithmetic for grid index computation (int((x - x_min) // cell_size)) to guarantee deterministic cell assignment.

Centroid coordinate precision in output. Rounding released centroid coordinates to fewer than 4 decimal degrees (approximately 11 m precision at the equator) can silently defeat the cell-size privacy guarantee — two different cells may round to the same coordinate. Retain at least 6 decimal places in GeoJSON output.

Compliance Alignment permalink

Grid aggregation maps cleanly to several regulatory and standards frameworks:

Control	Satisfied by
GDPR Art. 5(1)© — data minimisation	Centroid replaces precise coordinate; attributes not needed for the count are dropped
GDPR Art. 5(1)(e) — storage limitation	Grid definitions version-controlled; raw points not stored post-aggregation
GDPR Art. 25 — data protection by design	$k$ threshold and suppression rule are architectural, not post-hoc
CCPA “de-identification” safe harbour	Published dataset cannot “reasonably” be re-linked when $k \geq 5$ and auxiliary datasets checked
HIPAA Safe Harbor (§164.514(b))	Geographic identifiers reduced to cells no smaller than a three-digit ZIP code equivalent area
NIST SP 800-188 geospatial de-identification	Cell-size and $k$ parameters documented in the de-identification record

To satisfy both GDPR accountability (Art. 5(2)) and GDPR/CCPA compliance audit requirements, maintain a version-controlled configuration manifest containing:

Grid cell dimension and generation algorithm (square vs. hexagonal)
CRS projection parameters and transformation chain
$k$ threshold and suppression rule (omit or zero)
Grid origin coordinates and the date range they cover
Validation metrics: $d_{\max}$ , linkage-attack simulation results, Moran’s I ratio
Identity of the data controller and legal basis for processing

Accompany published datasets with a data dictionary stating the binning methodology, known limitations (edge suppression, rural data gaps), and recommended analytical use — this transparency satisfies external regulatory scrutiny and positions the technique as a defensible privacy control.

k-Anonymity Grouping for Location Traces — set and validate the k threshold before binning
Coordinate Jittering & Noise Injection Methods — complement binning in sparse density zones
Spatial Fuzzing & Buffer Zone Implementation — alternative when individual records must be preserved
Re-identification Risk Assessment for Geospatial Datasets — stress-test released centroids against auxiliary datasets
Implementing Hexagonal Grid Aggregation in PostGIS — hexagonal tessellation for lower edge effects

← Back to Geospatial Masking & Perturbation Techniques

Grid Aggregation & Spatial Binning Strategies

When to Use Grid Aggregation vs. Alternatives # permalink

Algorithmic Specification # permalink

Cell Assignment Rule # permalink

k-Anonymity Suppression # permalink

Parameter Reference # permalink

Prerequisites & Data Requirements # permalink

Step-by-Step Implementation # permalink

Step 1 — Validate and Reproject Input # permalink

Step 2 — Generate the Grid Tessellation # permalink

Step 3 — Spatial Join and Count # permalink

Step 4 — Apply k-Anonymity Suppression # permalink

Step 5 — PostGIS Native Execution (Production Scale) # permalink

Validation & Re-identification Testing # permalink

Entropy Check on Published Centroids # permalink

Neighbour-Count Audit # permalink

Moran’s I Utility Check # permalink

Common Failure Modes & Gotchas # permalink

Compliance Alignment # permalink

Related # permalink

Explore this section

Related topics