Understanding OSM multipolygon relations for GIS Jump to heading

OpenStreetMap encodes complex areal features through multipolygon relations, a structural primitive that aggregates multiple linear way members into a single topological entity. Within the broader OSM Data Fundamentals & Architecture framework, multipolygons resolve geometric limitations inherent to simple closed loops, enabling the representation of holes, disjoint components, and nested administrative boundaries. Unlike standalone polygons, a multipolygon relation explicitly assigns each member a role attribute (outer or inner), which dictates area accumulation and subtraction during geometry reconstruction. Correct interpretation is foundational for downstream GIS workflows, spatial indexing pipelines, and automated quality assurance systems, particularly when processing continental-scale .osm.pbf extracts.

Topological Constraints and Tag Taxonomy Jump to heading

flowchart TD
    R["Relation<br/>type = multipolygon"]
    R --> O1["Way (role=outer)<br/>exterior ring A"]
    R --> O2["Way (role=outer)<br/>exterior ring B"]
    R --> I1["Way (role=inner)<br/>hole inside A"]
    R --> I2["Way (role=inner)<br/>hole inside A"]
    R --> I3["Way (role=inner)<br/>hole inside B"]
    O1 -. contains .-> I1
    O1 -. contains .-> I2
    O2 -. contains .-> I3

The signed area of the assembled geometry is:

Area=oouterAo  iinnerAi\text{Area} = \sum_{o \in \text{outer}} |A_o| \; - \sum_{i \in \text{inner}} |A_i|

The Node-Way-Relation Data Model enforces strict planar graph semantics for multipolygon construction. Each outer ring contributes positive area, while inner rings subtract area from the containing outer boundary. Topological validity requires that rings remain non-self-intersecting, do not cross other rings, and share nodes only at explicit boundary intersections. OSM permits collinear segment sharing and single-node touches, but overlapping interiors or unclosed rings violate the OGC Simple Features specification and trigger geometry construction failures in standard GIS engines.

Tag inheritance follows a deterministic hierarchy: the relation’s key-value pairs are the authoritative source; member way tags apply only when the way is used standalone. Production pipelines must implement explicit key-filtering to prevent attribute bleeding from inner rings. The official Relation:multipolygon documentation outlines strict role assignment rules: inner rings must be fully contained within a single outer ring and cannot span multiple disjoint components without explicit relation splitting.

PBF Serialization and Parsing Overhead Jump to heading

The Protocolbuffer Binary Format (PBF) supersedes legacy XML by employing variable-length integer encoding, delta-compressed node coordinates, and a centralized string table. Identical string references (e.g., outer, inner, type=multipolygon) map to a single StringTable entry, reducing file size by 60–75% versus XML. However, this compression introduces sequential parsing constraints: relation members are referenced by numeric IDs that must be resolved against previously buffered node and way data.

Memory pressure spikes during multipolygon reconstruction when processing dense administrative boundaries or land-cover datasets. A two-pass architecture mitigates this: first, index member ways by relation ID using a lightweight on-disk store such as LMDB; second, reconstruct geometries using shapely.ops.polygonize or osmium’s native multipolygon handler. Streaming parsers must implement chunked relation buffering to prevent garbage collection thrashing.

Coordinate Reference Systems and Spatial Indexing Jump to heading

OSM natively stores coordinates in unprojected WGS 84 (EPSG:4326). Multipolygon relations must be projected to an equal-area or conformal CRS (e.g., EPSG:3035 for European land cover) before metric analysis. Projection should occur post-reconstruction to avoid coordinate drift during ring assembly.

R-tree implementations (e.g., libspatialindex or PostGIS GiST) outperform quadtree approaches for multipolygon queries due to their adaptive node splitting. Index construction should defer until after geometry validation to avoid storing degenerate records.

Historical Versioning and Licensing Compliance Jump to heading

OSM maintains full historical diffs via .osc.gz changesets, enabling temporal reconstruction of multipolygon evolution. Relation membership changes are atomic: adding or removing a way member increments the relation version, while coordinate updates to constituent nodes propagate independently. Historical ETL pipelines must reconcile versioned diffs using osmium-tool or pyosmium’s HistoryHandler, tracking visible flags to reconstruct past geometries accurately.

Licensing automation under the Open Database License (ODbL) requires explicit attribution tracking. Automated compliance pipelines should extract source, attribution, and license tags during ingestion, mapping them to a relational audit table. ODbL compliance mandates that derivative databases retain provenance metadata, which can be enforced via database triggers or ETL validation hooks.

Production ETL Implementation and Debugging Jump to heading

The following handler validates ring orientation, detects missing topology, and isolates self-intersections before PostGIS ingestion. It targets pyosmium>=3.6.0 and Shapely>=2.0.

python
import logging
import osmium
import shapely.geometry as geom
from shapely.geometry.polygon import orient
from shapely.validation import make_valid
from shapely.ops import transform as shp_transform
from pyproj import Transformer

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

# Build a reusable callable for shapely.ops.transform.
_transformer = Transformer.from_crs("EPSG:4326", "EPSG:3857", always_xy=True)

def _project(g):
    return shp_transform(_transformer.transform, g)


class MultipolygonETL(osmium.SimpleHandler):
    """Two-pass multipolygon handler: collect nodes/ways, then assemble relations.

    Call ``apply_file(path, locations=True)`` so pyosmium resolves node
    coordinates automatically; we capture them via the node() callback.
    """

    def __init__(self):
        super().__init__()
        self.way_coords: dict[int, list[tuple[float, float]]] = {}
        self.defect_log: list[str] = []
        self.relation_count = 0

    def node(self, n):
        # pyosmium sets n.location.valid() when locations=True is passed to apply_file.
        pass  # node coordinates are resolved automatically in way() via w.nodes[i].location

    def way(self, w):
        # Collect (lon, lat) pairs for later multipolygon assembly.
        # With locations=True, each NodeRef in w.nodes has a valid .location.
        coords = []
        for nr in w.nodes:
            loc = nr.location
            if loc.valid():
                coords.append((loc.lon, loc.lat))
        if coords:
            self.way_coords[w.id] = coords

    def relation(self, r):
        if r.tags.get("type") != "multipolygon":
            return

        self.relation_count += 1
        outer_rings: list[geom.LinearRing] = []
        inner_rings: list[geom.LinearRing] = []
        missing_ways: list[int] = []

        for member in r.members:
            if member.type != "w":
                continue
            coords = self.way_coords.get(member.ref)
            if not coords:
                missing_ways.append(member.ref)
                continue
            if len(coords) < 3:
                self.defect_log.append(
                    f"Relation {r.id}: Way {member.ref} has fewer than 3 nodes"
                )
                continue

            ring = geom.LinearRing(coords)
            if member.role == "outer":
                outer_rings.append(ring)
            elif member.role == "inner":
                inner_rings.append(ring)

        if missing_ways:
            self.defect_log.append(
                f"Relation {r.id}: Missing ways {missing_ways}"
            )
            return
        if not outer_rings:
            self.defect_log.append(f"Relation {r.id}: No outer rings defined")
            return

        try:
            # For a single outer ring, combine with all inners.
            # For multiple outer rings, use shapely.ops.polygonize on the full ring set.
            if len(outer_rings) == 1:
                poly = geom.Polygon(outer_rings[0], inner_rings)
            else:
                from shapely.ops import polygonize, unary_union
                all_rings = outer_rings + inner_rings
                polys = list(polygonize(all_rings))
                poly = unary_union(polys) if polys else geom.Polygon()

            valid_poly = make_valid(poly)
            if valid_poly.is_empty:
                self.defect_log.append(f"Relation {r.id}: Empty geometry after validation")
                return

            # Project to EPSG:3857 for a metric area sanity check.
            projected = _project(valid_poly)
            if projected.area < 1.0:  # < 1 m² is degenerate for real-world features
                self.defect_log.append(
                    f"Relation {r.id}: Degenerate projected area ({projected.area:.3g} m²)"
                )
                return

            # Enforce canonical winding order (CCW outer / CW inner).
            if isinstance(valid_poly, geom.Polygon):
                valid_poly = orient(valid_poly, sign=1.0)

            self._ingest_to_postgis(r.id, valid_poly, dict(r.tags))

        except Exception as e:
            self.defect_log.append(
                f"Relation {r.id}: Geometry construction failed: {e}"
            )

    def _ingest_to_postgis(self, rel_id: int, poly, tags: dict):
        # Placeholder for production DB insertion with ST_GeomFromWKB.
        pass


if __name__ == "__main__":
    handler = MultipolygonETL()
    handler.apply_file("extract.osm.pbf", locations=True, idx="flex_mem")
    print(f"Processed {handler.relation_count} multipolygon relations")
    for defect in handler.defect_log[:20]:
        print(f"  DEFECT: {defect}")

Reproducible fixes for common defects:

  • Ring orientation errors: Force LinearRing closure and validate signed area post-projection. Negative area indicates reversed winding order, correctable via orient(poly, sign=1.0).
  • Multi-outer assembly: Use shapely.ops.polygonize rather than manually pairing outer and inner rings when a relation contains more than one outer way.
  • Memory exhaustion: For full-planet processing, replace the in-memory way_coords dict with an LMDB or SQLite cache, and switch from idx='flex_mem' to idx='sparse_file_array' with an NVMe scratch volume.
  • Tag bleeding: Apply a strict key allowlist during relation parsing to prevent inner-ring attribute contamination.

PostGIS ingestion should utilize osm2pgsql --slim --flat-nodes with --hstore to preserve multipolygon tags, followed by ST_MakeValid and ST_CollectionExtract to guarantee OGC compliance. Run ST_IsValid checks regularly to catch topology regressions introduced by community edits or diff merges.