Extracting metadata from OSM planet files Jump to heading

Metadata extraction from OpenStreetMap planet files is a foundational operation for attribution tracking, contributor analytics, and compliance validation within geospatial ETL pipelines. Unlike spatial primitives governed by the Node-Way-Relation Data Model, OSM metadata encompasses provenance fields: uid, user, timestamp, version, changeset, and visible flags. In production environments, these fields drive QA workflows, historical diff reconciliation, and licensing automation. The extraction strategy diverges depending on whether the source is an uncompressed .osm.xml archive or the default .osm.pbf binary format, with the latter requiring precise handling of delta encoding and string table indexing as documented in the broader OSM Data Fundamentals & Architecture framework.

The Protocol Buffer Binary Format (PBF) stores metadata in a compressed, optional layer. Each PrimitiveBlock may include a DenseInfo (for dense nodes) or per-element Info messages for ways and relations, containing the provenance fields. uid, version, timestamp, and changeset are delta-encoded relative to the preceding element within the same block, while user strings are resolved via a shared StringTable. When the metadata layer is absent — as in anonymized or stripped extracts — uid will be 0 and user will be an empty string. A thorough examination of the PBF File Structure Deep Dive clarifies how StringTable offsets map to metadata arrays and why naive regex extraction fails on binary payloads.

For Python ETL pipelines, pyosmium (≥3.6.0) provides the most reliable streaming interface, avoiding full in-memory deserialization of multi-gigabyte planet files. The following handler demonstrates precise metadata extraction with UTC normalization and graceful fallback for anonymized elements:

python
import csv
import sys
from contextlib import closing
import osmium

CSV_HEADER = ["id", "type", "uid", "user", "timestamp", "version", "changeset", "visible"]


class MetadataExtractor(osmium.SimpleHandler):
    """Stream provenance metadata from an OSM PBF/XML extract into a CSV file.

    pyosmium resolves PBF delta encoding internally, so the handler simply
    captures the materialised attributes for each primitive. Use
    ``apply_file`` with ``locations=False`` to skip coordinate resolution;
    we only need metadata here.
    """

    def __init__(self, csv_file, batch_size: int = 100_000):
        super().__init__()
        self.writer = csv.writer(csv_file)
        self.writer.writerow(CSV_HEADER)
        self._file = csv_file
        self.batch_size = batch_size
        self.buffer: list[list] = []
        self._processed = 0

    def _flush_buffer(self):
        if not self.buffer:
            return
        self.writer.writerows(self.buffer)
        self._file.flush()
        self.buffer.clear()

    def _extract_meta(self, obj_type: str, obj_id: int, obj) -> None:
        ts = obj.timestamp
        uid = obj.uid
        user = obj.user if uid != 0 else "anonymous"
        self.buffer.append([
            obj_id,
            obj_type,
            uid,
            user,
            ts.isoformat() if ts is not None else "",
            obj.version,
            obj.changeset,
            obj.visible,
        ])
        self._processed += 1
        if len(self.buffer) >= self.batch_size:
            self._flush_buffer()

    def node(self, n):
        self._extract_meta("node", n.id, n)

    def way(self, w):
        self._extract_meta("way", w.id, w)

    def relation(self, r):
        self._extract_meta("relation", r.id, r)


if __name__ == "__main__":
    # Requires: pyosmium>=3.6.0, Python 3.10+
    # Usage: python extract_osm_meta.py planet-latest.osm.pbf osm_metadata.csv
    input_pbf, output_csv = sys.argv[1], sys.argv[2]
    with closing(open(output_csv, "w", encoding="utf-8", newline="")) as fh:
        handler = MetadataExtractor(fh, batch_size=150_000)
        # locations=False skips coordinate resolution; not needed for metadata.
        # idx="" means no location index is created, saving significant memory.
        handler.apply_file(input_pbf, locations=False)
        handler._flush_buffer()
    print(f"Extraction complete. Processed {handler._processed} primitives.")

By setting locations=False, the parser skips coordinate resolution entirely, reducing peak memory to approximately 1–2 GB for a standard 70 GB planet file. The visible attribute reflects the visible field in history (.osh.pbf) files; for regular planet files all elements are visible by definition, so the field is always True.

Debugging metadata extraction failures requires systematic validation. When pyosmium encounters corrupted Blob payloads or truncated PrimitiveBlocks, it raises RuntimeError during parsing. Reproducible fixes include:

  1. Pre-validating PBF integrity: Run osmium fileinfo -e planet-latest.osm.pbf to verify block structure before ETL execution.
  2. Handling anonymized edits: In history files, contributor details may be redacted; uid=0 and user="" are the canonical signals. The handler above maps these to "anonymous".
  3. Monotonicity checks: Versioned history data should have monotonically increasing version integers per (id, type) pair. Gaps indicate deleted or redacted primitives, which should be flagged for audit rather than silently dropped.

Tag taxonomy and key-value standards further complicate metadata attribution. Contributors frequently apply source=*, attribution=*, or license=* tags that must be cross-referenced against the OSM API’s /api/0.6/changeset/{id} endpoint to verify contributor consent and licensing alignment.

For large-scale deployments, transitioning from CSV to Apache Parquet via pyarrow (≥14.0.0) reduces storage overhead by 60–70% while preserving columnar query performance. The official pyosmium documentation details advanced handler patterns for metadata harvesting, while the OSM PBF Format Specification provides authoritative byte-level reference tables for custom parser development.