What k value is required to satisfy GDPR when publishing hexagonal aggregate maps?

GDPR Article 89 does not prescribe a numeric k, but supervisory authority guidance and the ISO 29101 privacy framework treat k ≥ 5 as a minimum baseline for statistical disclosure control. Health or sensitive-category data typically requires k ≥ 10 or k ≥ 20 depending on the jurisdiction and residual re-identification risk from auxiliary sources.

Why does ST_HexagonGrid produce distorted cells when I pass EPSG:4326 coordinates?

ST_HexagonGrid interprets the size parameter in the native units of the supplied geometry. In EPSG:4326 those units are decimal degrees, so a size of 1000 generates hexagons roughly 1000 degrees wide — completely unusable. Always project to a metric CRS (EPSG:3857 or a UTM zone) before calling the function.

How does hexagonal binning compare to square-grid binning for privacy utility?

Hexagonal cells have a uniform centre-to-neighbour distance and no directional bias, which means k-anonymity cells contain more spatially consistent groups than square grids where corner neighbours are √2 farther away. This geometric isotropy reduces differencing-attack surface area at cell edges, giving better privacy utility for the same cell count.

Can I combine hexagonal aggregation with differential privacy noise?

Yes — and it is recommended for high-sensitivity data. Apply the k-anonymity suppression first to eliminate cells with too few records, then add calibrated Laplace or Gaussian noise to the surviving cell counts before publication. The two mechanisms are complementary: suppression removes exact zeros and singletons that noise alone cannot adequately protect.

Implementing Hexagonal Grid Aggregation in PostGIS

PostGIS’s ST_HexagonGrid function lets you tessellate any study area into equal-area hexagons, join sensitive records to those cells, and enforce a k-anonymity threshold that prevents coordinate-level re-identification risk — producing a publication-ready aggregate dataset in a single SQL pipeline.

Core Formula and Parameter Table permalink

The privacy guarantee for a hexagonal aggregate rests on the k-anonymity condition applied to each cell:

k_{\text{cell}} = \lvert \{ r \in \text{records} \mid \text{ST\_Intersects}(h, r.\text{geom}) \} \rvert \geq k_{\min}

Any cell where $k_{\text{cell}} < k_{\min}$ is suppressed before publication. Cell size $s$ (in metres) controls the privacy-utility trade-off: smaller cells carry higher analytical fidelity but more suppressed cells in sparse regions; larger cells guarantee higher $k_{\text{cell}}$ values at the cost of spatial precision.

Parameter	Typical range	Privacy effect
`cell_size_m`	250 – 5 000 m	Larger → fewer suppressed cells, lower spatial precision
`k_min`	5 – 20	Higher → stronger suppression of rare locations
CRS (metric)	EPSG:3857 / UTM	Must be metric; degrees cause severe distortion
`suppress_value`	`NULL` or `–1`	Controls how suppressed cells appear downstream

Worked numeric example. Study area: 10 km × 10 km urban block. Cell size: 500 m. Expected grid: $\approx (10\,000/500)^2 \times \frac{2}{\sqrt{3}} \approx 462$ hexagons. With 8 000 records distributed uniformly, the mean cell count is $\approx 17$ . Setting $k_{\min} = 5$ suppresses only edge cells where population thins — roughly 8–12 % of cells, retaining $\geq 88$ % of the grid for publication.

Python Implementation permalink

The query below is wrapped in a production Python function using psycopg2. It accepts typed parameters, validates CRS input, and logs suppression statistics to a structured audit record — all privacy-relevant decisions are explained inline.

from __future__ import annotations

import logging
from dataclasses import dataclass, field
from typing import Any

import psycopg2
import psycopg2.extras

logger = logging.getLogger(__name__)


@dataclass
class HexAggResult:
    rows: list[dict[str, Any]]
    total_cells: int
    suppressed_cells: int
    suppression_rate: float
    cell_size_m: float
    k_min: int
    crs_epsg: int


def hexagonal_aggregate(
    conn: psycopg2.extensions.connection,
    source_table: str,
    geom_col: str = "geom",
    source_epsg: int = 4326,
    cell_size_m: float = 1000.0,
    k_min: int = 5,
    metric_epsg: int = 3857,
    value_col: str | None = None,
) -> HexAggResult:
    """
    Aggregate sensitive point records into a k-anonymous hexagonal grid.

    Privacy decisions
    -----------------
    * `k_min` (default 5): cells with fewer than k_min records are suppressed
      (aggregated_metric set to NULL, privacy_status set to 'SUPPRESSED').
    * `metric_epsg` must be a metric CRS — degrees produce malformed hexagons.
    * The function returns grid indices (i, j) so cells are deterministically
      referenceable across pipeline runs without re-exposing raw coordinates.

    Parameters
    ----------
    conn          : active psycopg2 connection (autocommit acceptable for reads)
    source_table  : fully-qualified table name, e.g. 'public.health_visits'
    geom_col      : geometry column name in source_table
    source_epsg   : EPSG code of source geometries (default 4326 / WGS 84)
    cell_size_m   : hexagon apothem in metres (distance from centre to edge)
    k_min         : minimum record count required to release a cell's metric
    metric_epsg   : metric CRS for grid generation (3857 or a UTM zone)
    value_col     : optional numeric column to aggregate (AVG); if None, only
                    counts are returned

    Returns
    -------
    HexAggResult dataclass with rows list and audit metadata
    """
    if metric_epsg == source_epsg and source_epsg == 4326:
        raise ValueError(
            "source_epsg and metric_epsg are both 4326 (degrees). "
            "Supply a metric CRS for metric_epsg, e.g. 3857 or a UTM zone."
        )

    # Build the optional aggregated metric expression
    if value_col:
        # Suppress metric for cells below the k floor; retain NULL semantics
        metric_expr = (
            f"CASE WHEN COUNT(p.{value_col}) >= %(k_min)s "
            f"THEN AVG(p.{value_col}::numeric) ELSE NULL END AS aggregated_metric"
        )
    else:
        metric_expr = "NULL::numeric AS aggregated_metric"

    sql = f"""
    WITH hex_grid AS (
        -- Generate hexagons over the data bounding box.
        -- ST_Extent returns a box2d with no SRID; re-tag it before reprojecting.
        SELECT (ST_HexagonGrid(
            %(cell_size_m)s,
            ST_Transform(
                ST_SetSRID(ST_Extent({geom_col}), %(source_epsg)s),
                %(metric_epsg)s
            )
        )).*
        FROM {source_table}
    ),
    spatial_join AS (
        -- INNER JOIN drops empty cells early to reduce memory pressure.
        SELECT
            h.geom  AS hex_geom,
            h.i,
            h.j,
            p.*
        FROM hex_grid h
        INNER JOIN {source_table} p
            ON ST_Intersects(
                h.geom,
                ST_Transform(p.{geom_col}, %(metric_epsg)s)
            )
    ),
    anonymized AS (
        SELECT
            hex_geom,
            i,
            j,
            COUNT(*) AS record_count,
            {metric_expr},
            CASE
                WHEN COUNT(*) >= %(k_min)s THEN 'RELEASED'
                ELSE 'SUPPRESSED'  -- k-anonymity floor not met
            END AS privacy_status
        FROM spatial_join
        GROUP BY hex_geom, i, j
    )
    SELECT
        ST_AsGeoJSON(hex_geom)::json AS geometry,
        i, j,
        record_count,
        aggregated_metric,
        privacy_status
    FROM anonymized
    ORDER BY i, j;
    """

    params = {
        "cell_size_m": cell_size_m,
        "source_epsg": source_epsg,
        "metric_epsg": metric_epsg,
        "k_min": k_min,
    }

    with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
        cur.execute(sql, params)
        rows = [dict(r) for r in cur.fetchall()]

    total = len(rows)
    suppressed = sum(1 for r in rows if r["privacy_status"] == "SUPPRESSED")
    rate = suppressed / total if total > 0 else 0.0

    logger.info(
        "hex_aggregate complete: total_cells=%d suppressed=%d (%.1f%%) "
        "cell_size_m=%.0f k_min=%d crs=%d",
        total, suppressed, rate * 100, cell_size_m, k_min, metric_epsg,
    )

    return HexAggResult(
        rows=rows,
        total_cells=total,
        suppressed_cells=suppressed,
        suppression_rate=rate,
        cell_size_m=cell_size_m,
        k_min=k_min,
        crs_epsg=metric_epsg,
    )

Verification Snippet permalink

Run these checks immediately after hexagonal_aggregate returns to confirm the implementation meets the target $k_{\min}$ bound and has no CRS artefacts:

def verify_hex_aggregate(result: HexAggResult, k_min: int) -> None:
    """
    Assert k-anonymity guarantees and flag unexpected suppression rates.

    Raises AssertionError if any RELEASED cell has record_count < k_min.
    Logs a warning if suppression_rate > 0.20 (suggests cell_size_m is too small
    for the point density, or k_min is too aggressive for this dataset).
    """
    violations = [
        r for r in result.rows
        if r["privacy_status"] == "RELEASED" and r["record_count"] < k_min
    ]
    assert not violations, (
        f"{len(violations)} RELEASED cells below k={k_min}: "
        f"first offender i={violations[0]['i']}, j={violations[0]['j']}, "
        f"count={violations[0]['record_count']}"
    )

    if result.suppression_rate > 0.20:
        logger.warning(
            "Suppression rate %.1f%% exceeds 20%%. "
            "Consider increasing cell_size_m (currently %.0f m) "
            "or reducing k_min (currently %d).",
            result.suppression_rate * 100,
            result.cell_size_m,
            result.k_min,
        )

    # Sanity-check: every cell must have a non-empty geometry
    empty_geom = [r for r in result.rows if not r.get("geometry")]
    assert not empty_geom, f"{len(empty_geom)} cells returned with NULL geometry"

    logger.info("verify_hex_aggregate PASSED: k_min=%d, cells=%d", k_min, result.total_cells)

Also confirm the CRS before running the pipeline:

# Quick SRID check — run before calling hexagonal_aggregate
with conn.cursor() as cur:
    cur.execute(
        "SELECT ST_SRID(geom) AS srid FROM public.sensitive_points LIMIT 1;"
    )
    row = cur.fetchone()
    assert row and row[0] == 4326, (
        f"Unexpected SRID {row[0] if row else 'NULL'} — "
        "update source_epsg parameter or re-project source data."
    )

Edge Cases and Adjustments permalink

Sparse or rural data. When point density drops below $\approx k_{\min} / \text{cell\_area}$ , most cells will be suppressed. Switch to a two-tier strategy: use a coarser grid (e.g. 5 000 m) for sparsely populated administrative units and a finer grid (e.g. 500 m) for urban cores, unioned in a single UNION ALL query.
Non-uniform density with hot-spots. Dense urban centres drive most cell counts well above $k_{\min}$ , while adjacent peri-urban fringe cells hover near the threshold. Apply coordinate jittering to source points before aggregation to smear hot-spot edges and reduce fringe suppression without changing cell geometry.
Temporal windowing. When aggregating timestamped mobility data, ensure the time window is wide enough that each cell contains $\geq k_{\min}$ distinct individuals — not just $k_{\min}$ visits from a smaller population. Add a COUNT(DISTINCT individual_id) check alongside COUNT(*).
CRS gotcha at high latitudes. EPSG:3857 (Web Mercator) distorts cell area significantly above 60°N. For Arctic, Nordic, or polar datasets use a local equal-area projection (e.g. EPSG:6933 WGS 84 / NSIDC EASE-Grid 2.0) to maintain uniform hexagon area.

Frequently Asked Questions permalink

What k value satisfies GDPR when publishing hexagonal aggregate maps?

GDPR Article 89 does not mandate a specific numeric threshold, but the European Data Protection Board’s anonymisation guidance and ISO 29101 treat $k \geq 5$ as a minimum baseline for statistical disclosure control. For health, mobility, or sensitive-category attributes, supervisory authorities expect $k \geq 10$ or higher. Map the required $k$ to your GDPR/CCPA compliance obligations and document the chosen threshold in your data-protection impact assessment.

Why does ST_HexagonGrid produce severely distorted cells when I pass EPSG:4326 coordinates?

ST_HexagonGrid interprets its size parameter in the native units of the supplied geometry. In EPSG:4326 those units are decimal degrees, so cell_size_m = 1000 generates hexagons approximately 1 000 degrees wide — covering the entire globe multiple times over. Always call ST_Transform(..., metric_epsg) on the bounding box before passing it to ST_HexagonGrid. The function itself does not reproject; the caller must supply metric coordinates.

How does hexagonal binning compare to square-grid binning for k-anonymity utility?

Hexagonal cells have a uniform centre-to-nearest-neighbour distance: every adjacent cell centroid is exactly one cell_size_m away. Square grids have two neighbour distances ( $1\times$ and $\sqrt{2}\times$ ), introducing directional bias in spatial linkage attack vectors that exploit corner-adjacency to reconstruct individual trajectories. The geometric isotropy of hexagons reduces this attack surface and produces more consistent $k_{\text{cell}}$ distributions across the grid.

Can hexagonal aggregation be combined with a differential privacy noise budget?

Yes — the two mechanisms are complementary. Apply the k-anonymity suppression first to eliminate singletons and near-zero cells, then add calibrated Laplace noise (from the privacy budget allocation for the query) to surviving cell counts. Suppression removes the cases where noise alone would need an impractically large magnitude to mask a count of 1 or 2. The combined approach is stronger than either mechanism in isolation.

Grid Aggregation & Spatial Binning Strategies — parent page covering square, hexagonal, and adaptive binning approaches
k-Anonymity Grouping for Location Traces — threshold selection and grouping algorithms for trajectory data
Re-identification Risk Assessment for Geospatial Datasets — quantifying residual risk after spatial aggregation
Privacy Budget Allocation for Spatial Queries — composing ε across hex-grid releases and repeated queries
Compliance Mapping for GDPR/CCPA Location Data — regulatory requirements for published spatial aggregates

← Back to Grid Aggregation & Spatial Binning Strategies

Implementing Hexagonal Grid Aggregation in PostGIS

Core Formula and Parameter Table # permalink

Python Implementation # permalink

Verification Snippet # permalink

Edge Cases and Adjustments # permalink

Frequently Asked Questions # permalink

Related # permalink

Related topics

Core Formula and Parameter Table permalink

Python Implementation permalink

Verification Snippet permalink

Edge Cases and Adjustments permalink

Frequently Asked Questions permalink

Related permalink