Privacy Budget Allocation for Spatial Queries

Q: When can I use parallel composition instead of sequential composition for spatial queries?

Only when every query operates on a strictly disjoint set of records — meaning no individual appears in more than one spatial partition. If administrative zones overlap (e.g. census tracts intersecting school districts), you must default to sequential composition.

Q: What ε value should I use for a municipal density map?

Public-sector guidance typically targets ε ≤ 1.0 for aggregate releases. Start with ε = 0.5 per query tier, reserve a δ budget at 1/N, and consult the setting-epsilon-values-for-spatial-heatmap-generation page for worked numeric examples.

Q: How does grid resolution affect noise magnitude?

Finer grids (higher H3 resolution) reduce the number of records per cell, which lowers raw counts and therefore amplifies relative noise. Coarser grids suppress noise but blur local spatial patterns — the optimal resolution minimises the sum of squared bias and noise variance.

Q: Does advanced composition always save budget versus basic sequential composition?

For large k (many queries), advanced composition yields a tighter bound proportional to ε√(2k ln(1/δ)) rather than kε. For small k (fewer than ~10 queries) the gain is marginal and the δ overhead may not be worth the added complexity.

Privacy budget allocation is the discipline of distributing a finite differential privacy (DP) ε budget across multiple spatial queries so that the cumulative privacy loss never exceeds the authorised threshold — preventing silent re-identification when overlapping regions, hierarchical grids, or repeated releases exhaust DP guarantees faster than naive accounting suggests.

When to Use This Technique permalink

Choosing the right budget strategy depends on query topology and the degree of record overlap between partitions. The diagram below maps that decision:

Budget composition decision flow: disjoint partitions allow parallel composition (ε = max); overlapping queries require sequential accumulation (ε = Σ). A shared ledger enforces hard caps before every release.

For coordinate-level noise injection alternatives — such as direct point displacement — the choice between Laplace and Gaussian mechanisms is covered in Laplace & Gaussian noise for coordinate data. For partitioning strategies that complement budget allocation, see grid aggregation and spatial binning.

Algorithmic Specification permalink

Composition theorems permalink

Let $$\mathcal{M}_1, \dots, \mathcal{M}_k$$ be DP mechanisms with individual privacy parameters $$(\varepsilon_i, \delta_i)$$.

Sequential composition (overlapping or unverified partitions):

\varepsilon_{\text{total}} = \sum_{i=1}^{k} \varepsilon_i, \qquad \delta_{\text{total}} = \sum_{i=1}^{k} \delta_i

Parallel composition (strictly disjoint record sets):

\varepsilon_{\text{total}} = \max_i \varepsilon_i, \qquad \delta_{\text{total}} = \max_i \delta_i

Advanced composition (k queries each at $$\varepsilon$$, failure probability $$\delta’$$):

\varepsilon_{\text{adv}} \approx \varepsilon \sqrt{2k \ln(1/\delta')} + k\varepsilon(e^{\varepsilon} - 1)

Advanced composition is valuable for large-scale iterative mapping workloads where strict sequential accounting would prematurely exhaust the budget.

Parameter reference table permalink

Parameter	Typical range	Spatial meaning
$$\varepsilon$$ per query	0.1 – 1.0	Privacy loss per spatial measurement; lower = stronger privacy
$$\delta$$	$$10^{-5}$$ – $$1/N$$	Catastrophic failure probability; must decrease as dataset grows
Global sensitivity $$\Delta f$$	1 (count), user-bounded (sum)	Maximum change in query result when one record is added/removed
Laplace scale $$b$$	$$\Delta f / \varepsilon$$	Standard deviation proxy for injected noise
H3 resolution	7 – 9 (city scale)	Grid cell size; resolution 8 ≈ 0.74 km² per cell

Noise mechanisms permalink

For count queries with global sensitivity $$\Delta f = 1$$, the Laplace mechanism injects noise $$\eta \sim \text{Lap}(0, 1/\varepsilon)$$:

\tilde{c} = c + \text{Lap}\!\left(0,\, \frac{1}{\varepsilon}\right)

For sum or mean queries, clip each record’s contribution to a domain-derived bound $$C$$ before counting, making $$\Delta f = C$$:

\tilde{s} = \sum_{i} \min(x_i, C) + \text{Lap}\!\left(0,\, \frac{C}{\varepsilon}\right)

Setting $$C$$ too high inflates noise; setting it too low introduces systematic downward bias in dense areas. Empirically sweep $$C$$ over a holdout partition and select the value that minimises RMSE.

Prerequisites & Data Requirements permalink

Before implementing budget allocation logic, confirm the following baseline:

CRS / projection: All coordinates in a consistent projection — EPSG:4326 (WGS84) for storage, reprojected to a metric CRS (e.g. EPSG:3857 or an appropriate UTM zone) before distance-sensitive operations.
Minimum dataset size: At least $$N \geq 1,000$$ records per spatial partition; smaller cells amplify noise to the point of utility collapse.
Column schema: user_id (or anonymous token), geometry (Point, WGS84), timestamp, h3_cell_id (pre-computed at the target resolution).
Python dependencies: geopandas >= 0.14, shapely >= 2.0, numpy >= 1.26, h3 >= 3.7, opendp >= 0.9 (or equivalent DP framework).
Validation harness: A holdout spatial partition reserved for comparing raw vs. anonymised aggregates before any public release.
CI/CD budget gate: Automated enforcement that rejects query plans exceeding remaining $$\varepsilon$$ — manual tracking alone is insufficient once multiple services query the same dataset.

Teams new to the broader threat landscape should review re-identification risk assessment for geospatial datasets to understand the adversarial context that drives these thresholds.

Step-by-Step Implementation permalink

Step 1 — Define the study area and grid permalink

Overlay a deterministic H3 hexagonal grid across the study area (EPSG:4326). Hexagonal grids are preferred over arbitrary administrative boundaries because they enforce uniform cell sizes and predictable adjacency relationships, making spatial linkage attack vectors easier to reason about.

import geopandas as gpd
import h3
import numpy as np
from shapely.geometry import Point

def assign_h3_cell(lon: float, lat: float, resolution: int = 8) -> str:
    """Return the H3 cell ID at the given resolution (WGS84 input)."""
    # h3.geo_to_h3 expects (lat, lon) order
    return h3.geo_to_h3(lat, lon, resolution)

# Apply to the full GeoDataFrame (coordinates in EPSG:4326)
gdf = gpd.read_file("transit_stops.gpkg").to_crs("EPSG:4326")
gdf["h3_cell_id"] = gdf.apply(
    lambda row: assign_h3_cell(row.geometry.x, row.geometry.y, resolution=8),
    axis=1,
)

At H3 resolution 8, each hexagonal cell covers approximately 0.74 km² — coarse enough to hold hundreds of records in an urban core, but granular enough to support neighbourhood-level analysis.

Step 2 — Verify partition disjointness before selecting composition permalink

Inspect whether any record appears in multiple query partitions. If queries align exactly with the H3 grid and each record is assigned to exactly one cell (using H3’s deterministic point-in-cell logic), parallel composition is valid. If queries cross-cut those cells (e.g. a watershed boundary spanning many H3 cells, some shared), fall back to sequential composition for the overlapping subset.

def check_disjoint_partitions(gdf: gpd.GeoDataFrame, partition_col: str) -> bool:
    """Return True only if every record appears in exactly one partition value."""
    # Each row has exactly one cell ID; duplicates within a partition would
    # indicate that a record was double-counted across query regions.
    duplicates = gdf.duplicated(subset=["user_id", partition_col])
    return not duplicates.any()

is_disjoint = check_disjoint_partitions(gdf, "h3_cell_id")
composition = "parallel" if is_disjoint else "sequential"
print(f"Composition strategy: {composition}")

Misapplying parallel composition to overlapping administrative zones — for example, census tracts intersecting school districts — is a common source of privacy leakage that is invisible until an adversary performs an auxiliary join.

Step 3 — Initialise the budget ledger permalink

from typing import Dict, List

class SpatialBudgetAllocator:
    """Tracks and enforces differential privacy budget for spatial query pipelines.

    All monetary and privacy accounting uses the sequential composition bound
    by default. Callers must explicitly opt into parallel composition after
    verifying partition disjointness (Step 2).
    """

    def __init__(
        self,
        total_epsilon: float,
        delta: float = 1e-5,
        composition: str = "sequential",
    ) -> None:
        if total_epsilon <= 0:
            raise ValueError("total_epsilon must be positive")
        self.total_epsilon = total_epsilon
        self.delta = delta
        self.composition = composition
        self.consumed_epsilon: float = 0.0
        self.query_log: List[Dict] = []

    def reserve(self, query_id: str, epsilon: float) -> None:
        """Reserve epsilon for a named spatial query; raise if budget is exceeded."""
        if epsilon <= 0:
            raise ValueError(f"Query epsilon must be positive, got {epsilon}")
        projected = self.consumed_epsilon + epsilon
        if projected > self.total_epsilon:
            raise ValueError(
                f"Budget exceeded: requested {projected:.4f} > "
                f"total {self.total_epsilon:.4f} (remaining "
                f"{self.remaining:.4f})"
            )
        self.consumed_epsilon += epsilon
        self.query_log.append({"id": query_id, "epsilon": epsilon})

    @property
    def remaining(self) -> float:
        return self.total_epsilon - self.consumed_epsilon

Step 4 — Execute differentially private count queries permalink

def dp_count(
    gdf: gpd.GeoDataFrame,
    cell_id: str,
    epsilon: float,
    rng: np.random.Generator,
) -> float:
    """Return a Laplace-noised count for one H3 cell.

    Global sensitivity for a count query is 1: adding or removing one record
    changes the count by at most 1, so the Laplace scale is 1/epsilon.
    Post-processing (clamping to 0) does not consume additional budget.
    """
    true_count = int((gdf["h3_cell_id"] == cell_id).sum())
    noise = rng.laplace(loc=0.0, scale=1.0 / epsilon)
    return max(0.0, true_count + noise)


def run_pipeline(
    gdf: gpd.GeoDataFrame,
    allocator: SpatialBudgetAllocator,
    cell_ids: List[str],
    epsilon_per_cell: float,
) -> Dict[str, float]:
    """Execute DP counts over a list of H3 cells with explicit budget reservation."""
    rng = np.random.default_rng(seed=None)  # Cryptographically seeded
    results: Dict[str, float] = {}
    for cell in cell_ids:
        allocator.reserve(query_id=f"count_{cell}", epsilon=epsilon_per_cell)
        results[cell] = dp_count(gdf, cell, epsilon_per_cell, rng)
    return results


# Example: ε = 1.0 split across 5 H3 cells, 0.2 per cell (sequential)
allocator = SpatialBudgetAllocator(total_epsilon=1.0, delta=1e-5)
target_cells = ["88283082a7fffff", "88283082a3fffff", "88283082b3fffff"]
noisy_counts = run_pipeline(gdf, allocator, target_cells, epsilon_per_cell=0.2)
print(f"Remaining ε: {allocator.remaining:.3f}")

Privacy implication of each step:

Spatial joins and filtering must complete before noise injection. Conditioning the query on a predicate evaluated post-noise can leak information if the predicate is sensitive.
Using np.random.default_rng() without a fixed seed prevents adversaries from reconstructing injected noise via seed-reuse attacks.
max(0.0, ...) is a valid post-processing step under DP composition theorems — it does not cost additional budget.

Step 5 — Handle sum/mean queries with sensitivity clipping permalink

For non-count aggregates (e.g. average dwell time per cell), clip each record’s contribution before summing:

def dp_clipped_mean(
    values: np.ndarray,
    clip_bound: float,
    epsilon: float,
    rng: np.random.Generator,
) -> float:
    """Differentially private mean via clipping then Laplace noise.

    Clipping to [-clip_bound, clip_bound] bounds global sensitivity to
    2 * clip_bound / len(values) for the mean, or clip_bound for the sum.
    Here we add noise to the sum and divide, which is equivalent and avoids
    a second sensitivity calculation.
    """
    n = len(values)
    if n == 0:
        return 0.0
    clipped_sum = float(np.clip(values, -clip_bound, clip_bound).sum())
    # Sensitivity of the clipped sum = clip_bound (one record changes sum by at most clip_bound)
    noisy_sum = clipped_sum + rng.laplace(loc=0.0, scale=clip_bound / epsilon)
    return noisy_sum / n

Choosing clip_bound requires domain knowledge: for transit dwell times in seconds, values above 3600 (one hour) are outliers and can be clipped without meaningful bias for the density analysis use-case. Document the clip threshold in your audit log alongside the $$\varepsilon$$ consumption record.

Validation & Re-identification Testing permalink

Budget allocation is only meaningful when paired with rigorous empirical validation. Spatial DP introduces structured noise that can distort spatial autocorrelation, edge effects, and hotspot detection. The following tests form a minimum viable validation suite.

Aggregate error metrics permalink

Compute RMSE and Mean Absolute Percentage Error (MAPE) between raw and anonymised counts over the holdout partition:

from sklearn.metrics import mean_squared_error
import numpy as np

def spatial_rmse(true_counts: np.ndarray, noisy_counts: np.ndarray) -> float:
    """Root mean square error between true and noisy spatial aggregates."""
    return float(np.sqrt(mean_squared_error(true_counts, noisy_counts)))

def mape(true_counts: np.ndarray, noisy_counts: np.ndarray) -> float:
    """Mean absolute percentage error; exclude zero true-count cells."""
    mask = true_counts > 0
    return float(
        np.mean(np.abs((true_counts[mask] - noisy_counts[mask]) / true_counts[mask]))
    ) * 100

An MAPE above 30% in the holdout partition typically signals that $$\varepsilon$$ is too small for the chosen grid resolution — either increase $$\varepsilon$$, coarsen the grid, or reduce the number of queries to reclaim budget.

Spatial autocorrelation preservation permalink

Use Moran’s I to verify that clustering patterns survive noise injection. A significant drop in Moran’s I indicates over-allocation of $$\varepsilon$$ to high-frequency queries relative to low-frequency ones:

from libpysal.weights import lat2W
from esda.moran import Moran

def moran_i(counts: np.ndarray, grid_rows: int, grid_cols: int) -> float:
    """Moran's I global spatial autocorrelation statistic for a regular grid."""
    w = lat2W(grid_rows, grid_cols)
    mi = Moran(counts, w)
    return float(mi.I)

Target: the anonymised Moran’s I should remain within ±0.15 of the raw value across 20 Monte Carlo trials.

Re-identification testing via auxiliary join simulation permalink

After publishing noisy counts, simulate a linkage attack: join the anonymised cell-level aggregates to a publicly available auxiliary dataset (e.g., census block populations) and check whether any cell’s residual information allows inferring individuals. This test mirrors the re-identification risk assessment methodology adapted to DP outputs. Cells with $$\tilde{c} < 5$$ after noise injection should be suppressed regardless of their noisy value.

The accuracy vs. utility tradeoffs in geospatial DP analysis provides a full quantitative framework for choosing the $$\varepsilon$$ value that maximises spatial pattern retention subject to a privacy constraint.

Common Failure Modes & Gotchas permalink

Projection mismatch before sensitivity calibration. Sensitivity calculations that assume metric distances (e.g. Euclidean distance in metres) will be wrong if coordinates remain in EPSG:4326 (degrees). Always reproject to a metric CRS before computing sensitivity bounds for distance-based queries.

Misapplied parallel composition. Administrative polygons (city limits, health service areas, school districts) routinely overlap. Never assume parallel composition is safe for data that was spatially joined to such polygons — verify partition disjointness programmatically (Step 2) every time the dataset or query topology changes.

Filtering inside the measurement chain. Applying a WHERE population_density > threshold filter after noise injection but before reporting changes the effective query and can leak the true count. All predicates must be applied to the raw data before the DP mechanism runs.

Sparse-data utility collapse. Cells with fewer than ~20 records produce noisy counts with relative error exceeding 50% even at $$\varepsilon = 1.0$$. Implement a minimum-cell-count suppression threshold and document it in the release metadata so downstream analysts understand the effective spatial resolution.

δ accounting under Gaussian mechanisms. Switching from the Laplace to the Gaussian mechanism improves utility for high-dimensional queries but introduces a $$\delta$$ term that must be tracked separately. Regulatory frameworks often require explicit $$\delta$$ disclosure — ensure your audit log captures both $$\varepsilon$$ and $$\delta$$ per query.

Budget leakage from repeated identical queries. Without query result caching, an analyst who issues the same spatial predicate twice consumes $$2\varepsilon$$ for a logically identical release. Cache noisy results keyed on the spatial predicate and $$\varepsilon$$ value; return the cached result on repeat requests.

Boundary-crossing artifacts at grid edges. When a feature (e.g. a transit corridor) crosses multiple H3 cells, record assignment to a single cell using a deterministic rule (centroid falls in cell, or lexicographic tie-breaking on cell IDs). Inconsistent assignment inflates effective query count and corrupts sequential composition accounting.

Compliance Alignment permalink

Privacy budget allocation maps directly to several regulatory controls:

Regulation / Standard	Relevant clause	How budget allocation satisfies it
GDPR Art. 5(1)©	Data minimisation	Treating $$\varepsilon$$ as a finite resource forces minimisation of statistical queries released
GDPR Art. 25	Privacy by design	Hard budget ceilings enforced before data release, not after
CCPA § 1798.100	Right to limit use	Query-level audit log provides per-purpose accounting of data use
HIPAA Safe Harbor	De-identification	DP with $$\varepsilon \leq 1$$ and $$\delta < 1/N$$ satisfies “expert determination” pathway when documented
NIST IR 8062	Privacy engineering	Budget ledger implements the “predictability” and “manageability” privacy design objectives

For public-sector and healthcare deployments, the compliance mapping for GDPR and CCPA location data page provides a full clause-by-clause mapping. Budget allocation policies should be reviewed annually alongside threat model updates — the privacy risk scoring frameworks for GIS provide a structured methodology for those reviews.

For $$\varepsilon$$ selection tuned specifically to continuous surface outputs, see setting epsilon values for spatial heatmap generation.

FAQ permalink

When can I use parallel composition instead of sequential composition for spatial queries? Only when every query operates on a strictly disjoint set of records — meaning no individual appears in more than one spatial partition. If administrative zones overlap (e.g. census tracts intersecting school districts), default to sequential composition.

What ε value should I use for a municipal density map? Public-sector guidance typically targets $$\varepsilon \leq 1.0$$ for aggregate releases. Start with $$\varepsilon = 0.5$$ per query tier, reserve $$\delta$$ at $$1/N$$, and consult setting epsilon values for spatial heatmap generation for worked numeric examples.

How does grid resolution affect noise magnitude? Finer grids (higher H3 resolution) reduce the number of records per cell, which lowers raw counts and amplifies relative noise. Coarser grids suppress noise but blur local spatial patterns. The optimal resolution minimises the sum of squared bias and noise variance — typically validated with the RMSE metric described in the Validation section.

Does advanced composition always save budget versus basic sequential composition? For large $$k$$ (many queries), advanced composition yields a tighter bound proportional to $$\varepsilon\sqrt{2k \ln(1/\delta)}$$ rather than $$k\varepsilon$$. For small $$k$$ (fewer than ~10 queries) the gain is marginal and the $$\delta$$ overhead may not justify the added complexity.

Privacy Budget Allocation for Spatial Queries

When to Use This Technique # permalink

Algorithmic Specification # permalink

Composition theorems # permalink

Parameter reference table # permalink

Noise mechanisms # permalink

Prerequisites & Data Requirements # permalink

Step-by-Step Implementation # permalink

Step 1 — Define the study area and grid # permalink

Step 2 — Verify partition disjointness before selecting composition # permalink

Step 3 — Initialise the budget ledger # permalink

Step 4 — Execute differentially private count queries # permalink

Step 5 — Handle sum/mean queries with sensitivity clipping # permalink

Validation & Re-identification Testing # permalink

Aggregate error metrics # permalink

Spatial autocorrelation preservation # permalink

Re-identification testing via auxiliary join simulation # permalink

Common Failure Modes & Gotchas # permalink

Compliance Alignment # permalink

FAQ # permalink

Explore this section

Related topics