Async Batch Processing for Catalog Updates

Operating within the broader Catalog Ingestion & ILS Sync Pipelines architecture, asynchronous batch processing is the layer that turns a pile of freshly harvested bibliographic records into a controlled, retry-safe stream of writes against a live integrated library system. Library-tech staff reach for this pattern the moment a nightly vendor dump, an OAI-PMH harvest, or a bulk authority refresh grows large enough that a single synchronous loop would either block circulation-facing services or trip the vendor’s rate limiter. This guide covers how to decouple ingestion from the commit stage, enforce idempotent writes, mask patron data before it reaches a log line, and recover cleanly when a worker dies mid-batch — with production-grade Python you can adapt directly into a Celery or RQ deployment.

Specification & Contract

Async batch processing has no single governing standard the way MARC or SIP2 does, but a reliable pipeline is built on an explicit internal contract: the task envelope. Every unit of work that enters the queue is a self-describing message carrying enough context to be validated, executed, retried, and audited independently of the batch it came from. Treat this envelope as a versioned schema — workers deployed at different times must agree on its shape, so pin a schema_version and validate on both produce and consume.

The envelope decouples three concerns the blueprint insists on keeping separate: data transformation, task orchestration, and the ILS commit itself. Transformation produces the payload; orchestration reads the routing and retry blocks; the commit stage consumes payload plus idempotency_key and reports back keyed on correlation_id.

Field	Type	Required	Description
`schema_version`	string (semver)	yes	Envelope contract version; workers reject unknown majors.
`correlation_id`	UUID	yes	Stable ID traced from source harvest through final ILS acknowledgment.
`batch_id`	UUID	yes	Groups tasks from one harvest window for reconciliation.
`record_id`	string	yes	Source control number (MARC `001`) or vendor primary key.
`operation`	enum (`create`/`update`/`delete`)	yes	Derived from MARC leader position 05; drives commit routing.
`idempotency_key`	string	yes	Deterministic hash of `record_id` + content digest; dedupes retries.
`payload`	object	yes	Normalized, PII-masked record body ready for the ILS write.
`content_digest`	string (sha256)	yes	Hash of the pre-mask source record for change detection and audit.
`retry`	object	no	`max_attempts`, `attempt`, `backoff_base` for orchestration.
`enqueued_at`	RFC 3339 timestamp	yes	Enqueue time; supports latency SLOs and stale-message eviction.

The operation field is the contract’s binding to the record itself: it is derived from the MARC leader, exactly as Parsing MARC Records with pymarc resolves leader position 05 (record status) into create, update, or delete routing. Getting that mapping wrong is the most common source of phantom deletes, so it belongs in the validated contract rather than in ad-hoc worker logic.

Prerequisites & Environment Setup

The reference implementation targets a modern async-capable runtime with a message broker and a distributed lock store. Confirm each item before wiring workers to a production ILS.

Never bake the ILS_TOKEN or PII_HASH_SALT into task payloads or source control. The salt in particular is load-bearing for irreversible patron anonymization; if it leaks, every hash it ever produced becomes attackable.

Core Implementation

The pipeline runs in four labeled stages: (1) build and validate the envelope, (2) enqueue with a stable idempotency key, (3) execute the transform under a distributed lock, and (4) commit to the ILS with explicit acknowledgment.

Step 1 — Define and validate the task envelope

Model the contract as a pydantic schema so malformed records are rejected at the boundary rather than three stages downstream. Validation failures raise before anything reaches the broker.

from __future__ import annotations

import hashlib
import uuid
from datetime import datetime, timezone
from enum import Enum

import structlog
from pydantic import BaseModel, Field, ValidationError

log = structlog.get_logger("catalog.batch")


class Operation(str, Enum):
    CREATE = "create"
    UPDATE = "update"
    DELETE = "delete"


class TaskEnvelope(BaseModel):
    schema_version: str = "1.0.0"
    correlation_id: uuid.UUID = Field(default_factory=uuid.uuid4)
    batch_id: uuid.UUID
    record_id: str
    operation: Operation
    idempotency_key: str
    payload: dict
    content_digest: str
    enqueued_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))


def build_envelope(batch_id: uuid.UUID, record_id: str,
                   operation: Operation, payload: dict) -> TaskEnvelope:
    """Construct a validated envelope with a deterministic idempotency key.

    The key binds record identity to content: a re-harvest with identical
    content yields the same key (safe to dedupe), while a genuine change
    produces a new key (safe to apply).
    """
    canonical = repr(sorted(payload.items())).encode("utf-8")
    content_digest = hashlib.sha256(canonical).hexdigest()
    idempotency_key = hashlib.sha256(
        f"{record_id}:{operation.value}:{content_digest}".encode("utf-8")
    ).hexdigest()
    try:
        return TaskEnvelope(
            batch_id=batch_id,
            record_id=record_id,
            operation=operation,
            idempotency_key=idempotency_key,
            payload=payload,
            content_digest=content_digest,
        )
    except ValidationError:
        log.error("envelope.invalid", record_id=record_id, operation=operation.value)
        raise

Pitfall: do not seed the idempotency key from a random UUID or a timestamp. If the key changes on every run, a network hiccup that triggers a retry will write the same record twice. The key must be a pure function of identity plus content.

Step 2 — Enqueue in idempotent chunks

Partition the batch into bounded chunks so a single failure never forces a full replay, and so back-pressure is observable. Each chunk is dispatched independently; the correlation ID rides along on every task.

from itertools import islice
from typing import Iterable, Iterator

from celery import Celery

app = Celery("catalog", broker="${BROKER_URL}", backend="${RESULT_BACKEND}")

CHUNK_SIZE = 250


def chunked(it: Iterable, size: int) -> Iterator[list]:
    iterator = iter(it)
    while batch := list(islice(iterator, size)):
        yield batch


def dispatch_batch(envelopes: Iterable[TaskEnvelope]) -> None:
    for chunk in chunked(envelopes, CHUNK_SIZE):
        for env in chunk:
            commit_record.apply_async(
                kwargs={"envelope": env.model_dump(mode="json")},
                task_id=env.idempotency_key,   # broker-level dedupe
                headers={"correlation_id": str(env.correlation_id)},
            )
        log.info("batch.chunk.dispatched", size=len(chunk))

Setting task_id to the idempotency key lets the broker itself reject an exact duplicate submission, giving you a second dedupe layer beneath the ILS-level guard in Step 4. The routing and backoff primitives that make this reliable across many worker nodes are covered in depth in Using Celery for Distributed Catalog Ingestion.

Step 3 — Execute the transform under a record-scoped lock

Concurrent workers will eventually pick up two updates to the same bibliographic or holdings record. Serialize those with a Redis lock scoped to the record identifier, with a TTL so a crashed worker cannot deadlock the queue.

import redis
from redis.exceptions import LockError

_redis = redis.Redis.from_url("${RESULT_BACKEND}")


def commit_under_lock(env: TaskEnvelope) -> None:
    lock = _redis.lock(
        name=f"lock:record:{env.record_id}",
        timeout=30,          # auto-release ceiling if the worker dies
        blocking_timeout=5,  # give up fast rather than pile up
    )
    acquired = lock.acquire()
    if not acquired:
        # Another worker holds this record; requeue rather than block.
        raise RecordLocked(env.record_id)
    try:
        _apply_to_ils(env)
    finally:
        try:
            lock.release()
        except LockError:
            # Lock already expired (slow commit); the TTL protected us.
            log.warning("lock.expired_before_release", record_id=env.record_id)

Pitfall: scope the lock to the record, never to the whole batch. A batch-wide lock silently collapses your concurrency back to one, and the symptom — a pipeline that is mysteriously as slow as the synchronous version you replaced — is hard to spot in metrics.

Step 4 — Commit to the ILS with explicit acknowledgment

The commit is the only stage allowed to talk to the vendor. It sends the idempotency key as a request header so the ILS can reject a duplicate write even if both broker and lock guards were bypassed, and it acknowledges the task only after a confirmed 2xx.

import httpx
from tenacity import (retry, stop_after_attempt, wait_exponential_jitter,
                      retry_if_exception_type)


class TransientILSError(Exception): ...
class PermanentILSError(Exception): ...


@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(initial=1, max=30),
    retry=retry_if_exception_type(TransientILSError),
    reraise=True,
)
def _apply_to_ils(env: TaskEnvelope) -> None:
    endpoint = f"${ILS_BASE_URL}/bibs/{env.record_id}"
    method = "DELETE" if env.operation is Operation.DELETE else "PUT"
    bound = log.bind(correlation_id=str(env.correlation_id),
                     record_id=env.record_id, operation=env.operation.value)
    with httpx.Client(timeout=httpx.Timeout(10.0, connect=5.0)) as client:
        resp = client.request(
            method, endpoint,
            headers={"Authorization": "Bearer ${ILS_TOKEN}",
                     "Idempotency-Key": env.idempotency_key},
            json=env.payload if method == "PUT" else None,
        )
    if resp.status_code in (429, 502, 503, 504):
        bound.warning("ils.transient", status=resp.status_code)
        raise TransientILSError(resp.status_code)
    if resp.status_code >= 400:
        bound.error("ils.permanent", status=resp.status_code)
        raise PermanentILSError(resp.status_code)
    bound.info("ils.committed", status=resp.status_code)

Distinguishing TransientILSError from PermanentILSError is what keeps retries honest: a 503 deserves an exponential-backoff retry, while a 422 schema rejection will fail identically on every attempt and belongs in the quarantine path from the next section. The backoff shape here mirrors the vendor-specific tuning in Configuring Exponential Backoff for Sierra API Calls.

PII & Compliance Checkpoints

Public sector deployments require that patron identifiers, circulation notes, and acquisition cost data never appear in a log line, a result backend, or an intermediate broker message. The batch pipeline has two mandatory checkpoints: mask before enqueue, and mask again before any telemetry is emitted.

Masking belongs before the envelope is built, so the payload that travels through the broker is already clean. Apply a deterministic salted hash to any identifier that must survive as an audit token, and replace free-text PII-bearing subfields with a structural placeholder. Because the transform is a pure function of the salt, reprocessing a record produces the same masked output — the operation is idempotent, which is what lets a dead-letter replay stay compliant.

import hashlib
import os

_SALT = os.environ["PII_HASH_SALT"].encode("utf-8")
_PII_SUBFIELDS = {"patron_barcode", "borrower_note", "email", "phone"}


def mask_payload(payload: dict) -> dict:
    """Irreversibly hash retained identifiers; redact free-text PII fields."""
    masked = dict(payload)
    for key in list(masked):
        if key in _PII_SUBFIELDS:
            value = str(masked[key]).encode("utf-8")
            masked[key] = "hash:" + hashlib.sha256(_SALT + value).hexdigest()[:16]
        elif key.endswith("_note"):
            masked[key] = "[REDACTED]"
    return masked

The broader retention rules that decide which fields may be retained as hashes versus dropped entirely are governed by Data Retention Policies for Public Libraries, and the field-level masking catalog is maintained under PII Masking in Patron Data Exports. Keep the content_digest computed over the pre-mask record so change detection stays accurate, but never persist the pre-mask body beyond the transform stack.

Error Handling & Quarantine Patterns

A record that cannot be committed must leave the main queue quickly and land somewhere it can be inspected, not retried into an infinite loop. Route by exception type: transient errors exhaust their retry budget and then go to the dead-letter queue for automated replay; permanent errors go straight to quarantine for human review.

class RecordLocked(Exception):
    def __init__(self, record_id: str):
        self.record_id = record_id
        super().__init__(record_id)


@app.task(bind=True, acks_late=True, max_retries=0)
def commit_record(self, envelope: dict) -> None:
    env = TaskEnvelope.model_validate(envelope)
    bound = log.bind(correlation_id=str(env.correlation_id), record_id=env.record_id)
    try:
        commit_under_lock(env)
    except RecordLocked:
        # Transient contention: requeue with a short countdown.
        raise self.retry(countdown=5, max_retries=3)
    except TransientILSError as exc:
        bound.warning("commit.transient_exhausted", error=str(exc))
        _dead_letter(env, reason="transient_exhausted")
    except (PermanentILSError, ValidationError) as exc:
        bound.error("commit.quarantined", error=str(exc))
        _quarantine(env, reason=type(exc).__name__)


def _quarantine(env: TaskEnvelope, reason: str) -> None:
    _redis.rpush("queue:quarantine", env.model_dump_json())
    log.info("record.quarantined", record_id=env.record_id, reason=reason)


def _dead_letter(env: TaskEnvelope, reason: str) -> None:
    _redis.rpush("queue:dead_letter", env.model_dump_json())
    log.info("record.dead_lettered", record_id=env.record_id, reason=reason)

Setting acks_late=True is essential: the task is acknowledged only after the body runs to completion, so a worker that is killed mid-commit leaves the message on the broker to be redelivered rather than silently lost. Structurally malformed records that fail even envelope validation should be routed through the same quarantine path that Schema Validation for Ingested Records uses at the ingestion boundary, so there is one place to review everything that could not be applied.

Performance Considerations

Throughput is governed by whether the bottleneck is I/O or CPU. Catalog commits are overwhelmingly I/O-bound — the worker spends its time waiting on the ILS — so the effective lever is concurrency, not raw core count. Two guidelines hold up in production:

Match worker concurrency to the vendor’s rate budget, not your core count. An adaptive asyncio.Semaphore sized to the negotiated request-per-second ceiling keeps you under the limiter while still saturating the connection pool. Oversubscribing simply converts throughput into 429s and retries.
Keep the transform streaming. Large vendor dumps must be parsed incrementally, not materialized into a list of records. The memory-boundary techniques — generator-based iteration, explicit del, bounded chunk sizes — are detailed in Optimizing pymarc Performance for Large Record Sets.

For genuinely mixed workloads — where MARC-to-payload normalization is CPU-heavy — run a two-layer topology: an asyncio event loop for the ILS network calls, feeding a ProcessPoolExecutor for the CPU-bound transform so the Global Interpreter Lock never serializes your parsing. Emit event-loop lag as a metric; a rising lag figure is the earliest signal that CPU work has leaked onto the I/O loop.

import asyncio


async def bounded_commit(envelopes: list[TaskEnvelope], rps: int) -> None:
    sem = asyncio.Semaphore(rps)

    async def _one(env: TaskEnvelope) -> None:
        async with sem:
            await asyncio.to_thread(commit_under_lock, env)

    async with asyncio.TaskGroup() as tg:
        for env in envelopes:
            tg.create_task(_one(env))

Verification & Testing

Validate the pipeline against a mock ILS before pointing it at production. Three assertions catch the failures that matter most: idempotency, correct operation routing, and clean quarantine behavior.

import respx
from httpx import Response


@respx.mock
def test_idempotent_retry_writes_once():
    route = respx.put(url__regex=r".*/bibs/.*").mock(
        side_effect=[Response(503), Response(200)]
    )
    env = build_envelope(uuid.uuid4(), "b1001", Operation.UPDATE,
                         {"title": "Test", "patron_barcode": "X"})
    _apply_to_ils(env)  # first call 503 -> retry -> 200
    # Same idempotency key on both attempts proves no duplicate identity.
    keys = {c.request.headers["Idempotency-Key"] for c in route.calls}
    assert len(keys) == 1


def test_payload_is_masked_before_enqueue():
    env = build_envelope(uuid.uuid4(), "b1002", Operation.UPDATE,
                         mask_payload({"patron_barcode": "2100450"}))
    assert env.payload["patron_barcode"].startswith("hash:")
    assert "2100450" not in env.model_dump_json()

Run these under pytest with the broker mocked so the suite stays hermetic. For a full end-to-end check, dispatch a small batch against a staging ILS, then reconcile the batch_id group: every envelope should appear exactly once in the ILS audit log, with the quarantine and dead-letter queues empty for a clean run.

Troubleshooting & FAQ

Why are some records written to the ILS twice?

The idempotency key is not deterministic. Confirm it is derived purely from record_id, operation, and content_digest — never a UUID or timestamp. Also verify the ILS actually honors the Idempotency-Key header; some endpoints ignore it, in which case the broker-level task_id dedupe in Step 2 becomes your only guard and must be enabled.

The pipeline is no faster than the old synchronous script. What happened?

Almost always a lock scoped too broadly. A batch-wide or table-wide Redis lock serializes every worker back to one. Scope the lock to the individual record_id so unrelated records commit in parallel, and confirm blocking_timeout is short enough that contended tasks requeue instead of stalling.

Workers keep raising `RecordLocked` and tasks pile up.

Two records with the same identifier are being dispatched inside one chunk, so they fight for the same lock. Deduplicate by idempotency_key at enqueue time, and set a small countdown on the retry so contention resolves without a hot loop. Persistent contention usually means the upstream harvest is emitting duplicate 001 control numbers.

A killed worker lost records mid-batch. How do I prevent that?

Ensure every task uses acks_late=True so the message is acknowledged only after the commit succeeds. Without it, the broker acks on delivery and a worker crash drops the in-flight record silently. With late acks, the message is redelivered and — because the write is idempotent — safely reapplied.

The quarantine queue is filling with 422 errors.

A 422 is a schema rejection, not a transient fault, so retrying is pointless. Inspect a quarantined envelope’s payload against the target contract; the usual cause is a field-mapping drift between your normalizer and the ILS schema. Fix the mapping upstream and replay the quarantine queue rather than editing records in place.

Retries never stop and the same record loops forever.

A permanent error is being classified as transient. Audit the exception mapping in _apply_to_ils: only 429 and 5xx should raise TransientILSError. Any 4xx other than 429 must raise PermanentILSError so it exits to quarantine on the first attempt instead of consuming the full retry budget.

Using Celery for Distributed Catalog Ingestion — worker routing, result backends, and at-least-once delivery for the queue behind this pattern.
Parsing MARC Records with pymarc — the transform layer that produces the validated payloads enqueued here.
Schema Validation for Ingested Records — boundary rejection and the shared quarantine path for malformed records.
ILS REST API Polling & Rate Limiting — token-bucket and adaptive concurrency controls for the commit stage.
Data Retention Policies for Public Libraries — retention and masking rules that govern what the payload may carry.