ILS Schema Translation Patterns

Operating within the broader Core Architecture & Catalog Standards architecture, this guide covers the problem every multi-system library hits: each integrated library system (ILS) — Sierra, Alma, Koha, FOLIO — exposes the same bibliographic and circulation facts through a different payload shape, and nothing downstream can consume all of them directly. Library-tech staff meet this the moment a discovery layer, an analytics warehouse, or a consortial partner needs records from an ILS whose REST API names its title field title in one product and buries it under mms_id-keyed MARCXML in another. Schema translation is the layer that resolves that mismatch deterministically: it deserializes each vendor payload into one vendor-neutral canonical object, and — for write-back — serializes that object into whatever the target expects.

This page walks the translation contract end to end: the canonical schema you enforce, the environment you need, an annotated Python implementation of the mapping registry, the compliance checkpoints for patron-adjacent fields, the quarantine patterns that keep one malformed vendor batch from poisoning the catalog, how to profile it at migration scale, and how to verify a run before it touches production. It is also the anchor for two failure-mode deep dives that live beneath it: Designing Zero-Trust Architecture for Library APIs and Implementing Circuit Breakers for ILS API Timeouts.

Specification & Contract

The contract has one rule that everything else follows from: no downstream stage ever sees a vendor payload. Each ingress adapter has exactly one job — turn a Sierra, Alma, Koha, or FOLIO record into a CanonicalBibRecord — and every later stage (validation, PII masking, serialization, upsert) operates only on that canonical object. This is the same normalization boundary the parent architecture calls the canonical envelope, and it is what lets you add a fifth ILS during a consortial migration without rewriting the transformation core.

The canonical schema is small on purpose. It captures the facts every consumer needs and nothing product-specific; vendor nuance that does not fit is folded into a typed extensions map rather than dropped. The record_id field is the join key: it becomes MARC 001 on write-back and the idempotency key for every upsert, so re-processing the same source record updates one catalog row instead of duplicating it — the same idempotency key discipline the ingestion pipelines rely on. Because that discipline traces bibliographic data back to MARC control fields, the field-level rules here stay consistent with MARC21 Field Mapping for Modern Pipelines.

The table below is the minimum translation contract most pipelines start from. Treat each row as a rule you enforce in tests, not a suggestion — an unmapped source path is a schema violation that routes to quarantine, never a silent default.

Canonical field	Sierra (REST /bibs)	Alma (Bib API)	Koha (REST)	FOLIO (Inventory)	Notes
`record_id`	`id`	`mms_id`	`biblionumber`	`hrid`	Becomes MARC `001`; the idempotency key for upsert
`title`	`title`	`245$a` (MARCXML)	`title`	`title`	Alma requires MARCXML field extraction, not a JSON key
`publication_year`	`publishYear`	`008/07-10`	`copyrightdate`	`publication[].dateOfPublication`	Normalize to a 4-digit `int`; reject partials
`subjects[]`	`subjects[]`	`6XX$a` (repeatable)	`subject`	`subjects[].value`	Preserve source order; one canonical entry per heading
`source_system`	literal `sierra`	literal `alma`	literal `koha`	literal `folio`	Provenance; becomes the audit partition key
`extensions{}`	`varFields`	local `9XX`	`koha_internal`	`notes[]`	Escape hatch for product-specific fields; typed map

Two rules govern the whole contract. First, the mapping is declarative, not imperative — each vendor’s field paths live in a registry keyed by source_system, so a vendor API change is a one-line registry edit, not a scattered code hunt. Second, provenance is preserved: source_system plus the extraction timestamp become the partition key and audit anchor, so any canonical record can be traced back to the exact system and batch that produced it. Where a source carries more nuance than the canonical schema expresses, fold the excess into extensions rather than discarding it — the same escape-hatch discipline used when mapping local fields in How to Map 9XX MARC Fields to BIBFRAME 2.0.

Prerequisites & Environment Setup

The translation service is pure Python plus one well-scoped dependency. Pin versions so a validation or coercion change never silently alters which records pass the gate.

Python 3.11 or newer (uses dataclasses, modern typing, and datetime.UTC).
pydantic>=2.5 for the canonical model, strict coercion, and field validators.
httpx>=0.27 for async ILS reads with per-request timeouts.
ILS read (and, for write-back, write) credentials in environment variables — never in source. Set ILS_API_BASE, ILS_API_TOKEN, and PII_MASKING_SALT; load them with os.environ at startup and fail fast if any is missing.
A quarantine location (an object-store prefix or a database table) with the same retention controls as the catalog itself.
A structured-logging sink that never receives raw bibliographic or patron-adjacent content (see the PII checkpoints below).

Install and freeze in one shot:

python -m venv .venv
source .venv/bin/activate
pip install "pydantic>=2.5" "httpx>=0.27"
pip freeze > requirements.txt

A common early pitfall: relying on pydantic’s default lax coercion. In v2, str inputs like "1998 " will coerce to int unless you validate them, so a dirty vendor date can pass silently. Enforce the shape in a field_validator (Step 2 below) rather than trusting the model to reject it for you.

Core Implementation

The pipeline has three labeled stages: define the canonical contract, register per-vendor mappings, then translate each raw payload through a pure function. Every stage is independently testable, and the translation function is stateless so it scales horizontally during peak catalog-update windows.

Step 1 — The canonical model

Define the vendor-neutral schema once. extra="forbid" makes an unexpected key an error rather than silent data loss, and the year validator rejects the partial dates vendors love to emit. Keep the model pure data — no I/O, no vendor branching.

from __future__ import annotations

from typing import Any

from pydantic import BaseModel, Field, ConfigDict, field_validator


class CanonicalBibRecord(BaseModel):
    """Vendor-neutral bibliographic record produced by every ingress adapter."""

    model_config = ConfigDict(extra="forbid", frozen=True)

    record_id: str = Field(..., min_length=1)
    title: str = Field(..., min_length=1)
    source_system: str
    publication_year: int | None = None
    subjects: list[str] = Field(default_factory=list)
    extensions: dict[str, Any] = Field(default_factory=dict)

    @field_validator("publication_year", mode="before")
    @classmethod
    def normalize_year(cls, v: Any) -> int | None:
        if v is None or v == "":
            return None
        year_str = str(v).strip()
        if not year_str.isdigit() or len(year_str) != 4:
            raise ValueError(f"publication_year must be a 4-digit integer, got {v!r}")
        return int(year_str)

Step 2 — The declarative mapping registry

Instead of one branchy function per vendor, describe each vendor’s field paths as data. The registry maps a canonical field name to the source key (or a small extractor callable for nested structures like Alma MARCXML). Adding a vendor is a new dict entry; changing a vendor’s API is a one-line edit.

from typing import Any, Callable

# A path is either a top-level key or a callable that pulls from a nested payload.
VendorPath = str | Callable[[dict[str, Any]], Any]

MAPPING_REGISTRY: dict[str, dict[str, VendorPath]] = {
    "sierra": {
        "record_id": "id",
        "title": "title",
        "publication_year": "publishYear",
        "subjects": "subjects",
    },
    "folio": {
        "record_id": "hrid",
        "title": "title",
        "publication_year": lambda p: (p.get("publication") or [{}])[0].get("dateOfPublication"),
        "subjects": lambda p: [s["value"] for s in p.get("subjects", [])],
    },
}


def resolve(path: VendorPath, payload: dict[str, Any]) -> Any:
    """Resolve one canonical field from a raw vendor payload."""
    if callable(path):
        return path(payload)
    return payload.get(path)

Step 3 — The pure translation function

Translation is a stateless function: look up the vendor’s mapping, resolve each field, and let the pydantic model enforce the contract. Because it never touches I/O, you can run thousands of these in parallel and unit-test every vendor variant without a live ILS. A missing source_system mapping raises immediately rather than emitting a half-populated record.

class UnknownVendorError(KeyError):
    """Raised when no mapping is registered for a source system."""


def translate_to_canonical(
    raw_payload: dict[str, Any], source_system: str
) -> CanonicalBibRecord:
    """Map a vendor payload to the canonical schema. Pure and side-effect free."""
    try:
        mapping = MAPPING_REGISTRY[source_system]
    except KeyError as exc:
        raise UnknownVendorError(f"No mapping registered for {source_system!r}") from exc

    fields: dict[str, Any] = {"source_system": source_system}
    for canonical_field, path in mapping.items():
        fields[canonical_field] = resolve(path, raw_payload)

    # Route everything the mapping did not claim into the typed escape hatch.
    claimed = set(mapping.values()) | {"id", "hrid", "mms_id", "biblionumber"}
    fields["extensions"] = {
        k: v for k, v in raw_payload.items()
        if isinstance(k, str) and k not in claimed
    }
    return CanonicalBibRecord(**fields)

Pitfall: do not let the adapter swallow a ValidationError and substitute a default. A record that fails coercion must surface as a failure so the gate in the next section can quarantine it with its reason — a silently defaulted publication_year corrupts downstream analytics without any audit trail.

PII & Compliance Checkpoints

Bibliographic payloads look innocuous, but ILS records routinely drag patron-adjacent data along with them — a hold queue on a varFields note, a donor identity in an acquisitions extension, a course-reserve link carrying a student ID. The translation layer must treat those as radioactive and never let them reach a log aggregator or a canonical field that should not hold them. The full boundary model lives in Data Privacy Boundaries in Library Systems, and the masking mechanics mirror PII Masking in Patron Data Exports.

Mask deterministically at the boundary. Hash direct identifiers with a salted SHA-256 before any canonical record is logged or exported. Determinism means the same identifier hashes identically across runs, so joins still work while the raw value never leaves the trust boundary.
Log deltas, not content. Audit entries carry field-level change deltas, a correlation ID, the processing timestamp, and the service identity — never raw bibliographic content.
Strip patron/donor notes on the way through. If an extensions entry carries an identity, drop it or route it to a restricted store, following the retention rules in Data Retention Policies for Public Libraries.

import hashlib
import os

SALT = os.environ["PII_MASKING_SALT"].encode()  # fail fast if unset
SENSITIVE_KEYS = ("patron_barcode", "email", "phone", "address_line1")


def mask_pii(record: dict[str, object]) -> dict[str, object]:
    """Deterministically hash direct identifiers before logging or export."""
    masked = dict(record)
    for key in SENSITIVE_KEYS:
        value = str(masked.get(key, "")).strip()
        if value:
            masked[key] = hashlib.sha256(SALT + value.encode()).hexdigest()[:16]
    return masked

Retain audit snapshots immutably for the statutory retention period so a translation can be reconstructed during a compliance review. The zero-trust posture that keeps these credentials and payloads verified in transit is detailed in Designing Zero-Trust Architecture for Library APIs.

Error Handling & Quarantine Patterns

Every translated record must clear a structural check before it reaches the catalog. The gate below counts failures, trips a circuit when the malformed rate crosses a threshold, and routes bad records aside instead of aborting a whole overnight batch. Failed records land in a quarantine dataset with their source context preserved for manual review — exactly the pattern used by the schema validation quarantine queue.

import logging
from dataclasses import dataclass, field
from datetime import datetime, UTC
from typing import Any

from pydantic import ValidationError

logger = logging.getLogger("ils.translate.audit")


class BatchAbort(RuntimeError):
    """Raised when the malformed-record rate breaches the circuit threshold."""


@dataclass
class TranslationGate:
    error_threshold: float = 0.02          # 2% malformed per batch trips the breaker
    total: int = 0
    errors: int = 0
    quarantine: list[dict[str, Any]] = field(default_factory=list)

    def _log(self, record_id: str, status: str, detail: str) -> None:
        logger.info(
            '{"ts":"%s","record_id":"%s","status":"%s","detail":"%s"}',
            datetime.now(UTC).isoformat(), record_id, status, detail,
        )

    def process(self, payload: dict[str, Any], source_system: str) -> CanonicalBibRecord | None:
        self.total += 1
        rid = str(payload.get("id") or payload.get("hrid") or "unknown")
        try:
            record = translate_to_canonical(payload, source_system)
        except (ValidationError, UnknownVendorError, ValueError) as exc:
            self.errors += 1
            self.quarantine.append({"record_id": rid, "reason": str(exc), "payload": payload})
            self._log(rid, "QUARANTINED", type(exc).__name__)
            if self.total and (self.errors / self.total) >= self.error_threshold:
                raise BatchAbort(f"error rate {self.errors / self.total:.2%} exceeded threshold")
            return None
        self._log(rid, "VALID", record.source_system)
        return record

Wrap the ILS write-back that follows in exponential backoff with jitter and make it idempotent on record_id, so a transient 503 from the catalog retries safely without creating duplicates. The retry mechanics — including when to give up and quarantine versus keep retrying — are the same ones detailed in Implementing Circuit Breakers for ILS API Timeouts, and the backoff curve mirrors Configuring Exponential Backoff for Sierra API Calls. The async read that feeds the gate should honor the same discipline:

import asyncio
import random

import httpx


async def resilient_fetch(client: httpx.AsyncClient, url: str, max_retries: int = 3) -> httpx.Response:
    """Async ILS read with full-jitter exponential backoff."""
    for attempt in range(max_retries):
        try:
            response = await client.get(url, timeout=10.0)
            response.raise_for_status()
            return response
        except (httpx.RequestError, httpx.HTTPStatusError):
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(random.uniform(0, min(30.0, 2.0 ** attempt)))
    raise RuntimeError("unreachable")

Performance Considerations

At migration scale — a consortium re-keying a million bib records overnight — the translation layer’s cost is dominated by how you move records through it, not by pydantic itself.

Stream, don’t accumulate. Pull vendor records with a cursor and hand them to TranslationGate.process one at a time; never build a list of every raw payload first. Materializing a full million-record page is the difference between a flat resident set and an out-of-memory kill.
Reuse the model, don’t rebuild it. CanonicalBibRecord is defined once at import; constructing a new pydantic model class per record (a surprisingly common accident) throws away pydantic’s compiled validator cache and dominates CPU.
Batch the write-back. Group serialized records into upsert batches sized to the ILS’s rate window rather than one HTTP call per record, per ILS REST API Polling & Rate Limiting.
Profile before tuning. Wrap a representative source through tracemalloc to find the real high-water mark. The streaming and profiling techniques carry over directly from Optimizing pymarc Performance for Large Record Sets.

For very large migrations, run ingress, translation, and write-back as separate tasks so each scales independently — the distributed pattern in Async Batch Processing for Catalog Updates fits this layer directly.

Verification & Testing

Prove correctness before a batch reaches production. Because translation is a pure function, every vendor variant is assertable in isolation without a live ILS.

Translate a golden payload per vendor. For each source_system, feed a known fixture through translate_to_canonical and assert record_id, title, and publication_year match the expected canonical values. This catches registry-path regressions the moment a vendor API shifts.
Assert the reject path. Feed a payload with a partial date ("199") and assert it raises ValidationError rather than silently defaulting.
Test the circuit breaker. Push a stream where 3% of payloads are malformed and assert TranslationGate.process eventually raises BatchAbort; a 1% stream must return the valid remainder and populate quarantine.
Test idempotency. Run the same source record through translate-and-upsert twice against a stub ILS and assert one create and one update (keyed on record_id), not two creates.

def test_folio_year_normalization():
    payload = {"hrid": "in00001", "title": "Deep Catalogs",
               "publication": [{"dateOfPublication": "1998"}], "subjects": []}
    record = translate_to_canonical(payload, source_system="folio")
    assert record.record_id == "in00001"
    assert record.publication_year == 1998
    assert record.source_system == "folio"

Troubleshooting & Frequently Asked Questions

Why does a record fail translation when the vendor API clearly returned data?

Almost always a registry-path mismatch after a vendor API change, or an unexpected key hitting extra="forbid". Confirm the MAPPING_REGISTRY entry for that source_system still points at the live field names — vendors rename JSON keys between API versions. If translation raises ValidationError for an unexpected field, the payload gained a key your mapping does not claim; route it into extensions rather than adding it to the strict model.

The pipeline halts overnight with a `BatchAbort`. How do I recover?

A BatchAbort means the malformed rate crossed the 2% threshold, which almost always signals a systematic cause — a changed vendor vocabulary or a shifted field path — not random corruption. Inspect the quarantine list; the reason strings cluster around one root cause. Fix the registry entry, then re-run only the quarantined records. Because upsert is keyed on record_id, re-processing is safe and non-duplicating.

Two ILS systems disagree on the same record. Which wins?

Translation does not resolve conflicts — it makes them explicit. Carry source_system and the extraction timestamp on every canonical record and let the write-back stage apply your source-of-truth policy (typically: the cataloging ILS wins bibliographic fields, the circulation ILS wins item status). Never let two adapters write the same record_id without a precedence rule, or the last batch to run silently overwrites the other.

Publication years arrive as `None` for records that clearly have a date. Why?

The source emitted a partial or non-numeric date ("199", "1998-2001", "[1998]") and the normalize_year validator rejected it. That is intended — a wrong year is worse than a missing one for analytics. If your collection legitimately uses date ranges, extend the validator to parse them into extensions rather than loosening the canonical int contract.

How do I keep patron or donor data out of the canonical records and logs?

Mask at the boundary with the salted mask_pii function before any record is logged or exported, and emit only field-level deltas to the audit sink. Strip identity-bearing extensions entries during translation. The full boundary model is in Data Privacy Boundaries in Library Systems.

Core Architecture & Catalog Standards — the parent architecture this translation layer sits within.
MARC21 Field Mapping for Modern Pipelines — the field-level mapping rules the canonical schema enforces.
BIBFRAME to MARC21 Conversion Workflows — the round-trip serialization that consumes canonical records.
Designing Zero-Trust Architecture for Library APIs — securing the credentials and payloads this layer moves.
Implementing Circuit Breakers for ILS API Timeouts — the retry and breaker mechanics for write-back.