How to decode OSM PBF headers in Python Jump to heading

The Protocol Buffer Binary Format (PBF) is the standard for distributing OpenStreetMap extracts. Its compact serialization and deterministic parsing make it the right choice for production ETL pipelines. Correctly parsing the initial header blob is a prerequisite for downstream validation, replication tracking, and licensing compliance. Unlike OSM XML, PBF headers are tightly packed Protocol Buffers that require precise binary framing, stream-safe decompression, and schema resolution before any data blobs can be read.

Protocol Buffer Compilation & Build Configuration Jump to heading

Direct decoding requires the official OSM schema definitions from the OSM PBF Format specification. Compile the canonical fileformat.proto and osmformat.proto files using protoc pinned to version 3.21.12 or higher:

bash
protoc --python_out=. --proto_path=./proto \
    ./proto/fileformat.proto ./proto/osmformat.proto

This generates fileformat_pb2.py and osmformat_pb2.py. Vendor these alongside your pipeline. Ensure your runtime uses protobuf>=4.21.0 for compatibility with the generated google.protobuf API. For the schema files themselves, use the copies from the OSM GitHub repository.

Binary Framing & Size Validation Jump to heading

A valid PBF file begins with a 4-byte big-endian uint32 declaring the length of the first BlobHeader. That BlobHeader is a Protocol Buffer message, followed immediately by the Blob whose size is reported in the header’s datasize field. The OSM PBF specification caps BlobHeader at 64 KiB and Blob at 32 MiB; exceeding either threshold indicates a corrupted stream or misaligned file pointer.

python
import struct
import zlib
import logging
import fileformat_pb2
import osmformat_pb2

logger = logging.getLogger(__name__)

MAX_BLOB_HEADER_SIZE = 64 * 1024          # spec ceiling: 64 KiB
MAX_BLOB_PAYLOAD_SIZE = 32 * 1024 * 1024  # spec ceiling: 32 MiB


def decode_pbf_header(filepath: str) -> osmformat_pb2.HeaderBlock:
    with open(filepath, "rb") as f:
        # First four bytes: BlobHeader length as big-endian uint32.
        prefix = f.read(4)
        if len(prefix) != 4:
            raise ValueError("File too short to contain a BlobHeader length prefix")
        header_len = struct.unpack(">I", prefix)[0]
        if header_len > MAX_BLOB_HEADER_SIZE:
            raise MemoryError(f"BlobHeader length {header_len} exceeds 64 KiB threshold")

        header_data = f.read(header_len)
        if len(header_data) != header_len:
            raise ValueError("Truncated BlobHeader")

        blob_header = fileformat_pb2.BlobHeader()
        blob_header.ParseFromString(header_data)

        if blob_header.type != "OSMHeader":
            raise ValueError(
                f"Expected BlobHeader type 'OSMHeader', got '{blob_header.type}'"
            )
        if blob_header.datasize > MAX_BLOB_PAYLOAD_SIZE:
            raise MemoryError(
                f"Blob datasize {blob_header.datasize} exceeds 32 MiB threshold"
            )

        blob_data = f.read(blob_header.datasize)
        if len(blob_data) != blob_header.datasize:
            raise ValueError("Truncated Blob payload detected")

        return _decompress_blob(blob_data)


def _decompress_blob(raw_blob: bytes) -> osmformat_pb2.HeaderBlock:
    blob = fileformat_pb2.Blob()
    blob.ParseFromString(raw_blob)

    # The OSM PBF spec defines three payload fields: raw, zlib_data, and lzma_data.
    # zlib_data is by far the most common in practice.
    if blob.HasField("zlib_data"):
        decompressed = zlib.decompress(blob.zlib_data)
    elif blob.HasField("raw"):
        decompressed = blob.raw
    elif blob.HasField("lzma_data"):
        import lzma
        decompressed = lzma.decompress(blob.lzma_data)
    else:
        raise ValueError("Blob contains no recognized compression or raw payload")

    header_block = osmformat_pb2.HeaderBlock()
    header_block.ParseFromString(decompressed)
    return header_block

Decompression & Schema Resolution Jump to heading

Once the Blob is decompressed, the HeaderBlock exposes metadata that dictates how downstream parsers must handle the data. The required_features and optional_features repeated string fields declare generator capabilities such as OsmSchema-V0.6 or DenseNodes. Parsers must validate these before initializing data structures; dense node encoding, for example, alters how coordinate delta chains are accumulated.

The HeaderBlock also contains a bbox sub-message with left, right, top, and bottom fields expressed in nanodegrees (integer units of 10⁻⁹ degrees). Divide by 1,000,000,000 to get decimal degrees in WGS 84:

python
def extract_bounding_box(header_block: osmformat_pb2.HeaderBlock) -> dict:
    """Return bbox in decimal degrees (EPSG:4326) from the HeaderBlock."""
    NANO = 1e-9
    bbox = header_block.bbox
    return {
        "left":   bbox.left   * NANO,
        "right":  bbox.right  * NANO,
        "top":    bbox.top    * NANO,
        "bottom": bbox.bottom * NANO,
    }

Failing to apply the nanodegree conversion introduces a systematic 10⁹× spatial offset — transformed coordinates will land nowhere near the expected region.

The writingprogram and source string fields provide provenance metadata useful for ODbL attribution tracking and replication state management. The osmosis_replication_sequence_number and osmosis_replication_timestamp fields anchor the file within the OSM replication stream; cross-reference them against the OSM replication manifests at https://planet.openstreetmap.org/replication/ to verify differential updates are applied idempotently.

Production Pipeline Integration Jump to heading

In high-throughput ETL architectures, header decoding should be an isolated pre-flight validation step. Parsing the HeaderBlock before streaming data blobs enables early detection of incompatible feature flags, missing replication sequence numbers, or generator-specific extensions your parser does not support.

For spatial indexing pipelines, the header’s bounding box and feature flags determine whether to initialize an R-tree, quadtree, or H3 grid before reading data. When fetching from cloud storage, use HTTP range requests to read only the first few kilobytes:

python
import urllib.request

def fetch_pbf_header_from_url(url: str) -> osmformat_pb2.HeaderBlock:
    """Fetch only enough bytes to decode the OSMHeader blob."""
    req = urllib.request.Request(url, headers={"Range": "bytes=0-65535"})
    with urllib.request.urlopen(req) as resp:
        chunk = resp.read()
    # Write to a temp file so decode_pbf_header can seek normally.
    import tempfile, os
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pbf") as tmp:
        tmp.write(chunk)
        tmp_path = tmp.name
    try:
        return decode_pbf_header(tmp_path)
    finally:
        os.unlink(tmp_path)

This reduces cold-start latency from minutes (full download) to under a second for typical extracts, which is valuable in serverless ETL environments where you need to validate a file before committing to a full parse.

For authoritative binary framing references, consult the Python struct documentation and the OSM Wiki PBF Format specification.