Using Celery for Distributed Catalog Ingestion

This page answers one narrow operational question: when you fan a nightly bibliographic ingest across a Celery worker fleet, why do workers get OOM-killed, why do the same holdings get written to the integrated library system twice, and why do tasks silently vanish in a PENDING state — and how do you fix all three without watering down throughput. It sits under Async Batch Processing for Catalog Updates, which defines the task-envelope contract this page assumes, and within the broader Catalog Ingestion & ILS Sync Pipelines architecture. If you are choosing Celery as the orchestration layer for MARC21/XML parsing, holdings sync, and circulation deltas, start here before you scale past a single worker.

Problem Framing

Three failure modes surface together the moment a single synchronous ingest loop is replaced with a distributed Celery deployment. They look unrelated in the logs but share a small number of root causes.

1. Workers are OOM-killed on large vendor dumps. A container that ran fine against a 5,000-record test file dies against the full nightly feed:

$ dmesg | grep -i "killed process"
[ 8123.44] Out of memory: Killed process 4417 (celery) total-vm:6291456kB, anon-rss:5872140kB

2. The same record is written to the ILS twice. Reconciliation reports show duplicate OCN/ISBN rows after a retry storm. The broker redelivered a task whose write had already committed, and nothing deduplicated the second attempt.

3. Tasks stall in PENDING or RECEIVED and never complete. Inspecting the result backend shows jobs that were dispatched but never acknowledged:

$ celery -A catalog inspect active --json | jq '.[] | length'
0
$ redis-cli --scan --pattern 'celery-task-meta-*' | head
celery-task-meta-3f2a...   # status: PENDING, enqueued 40 min ago

PENDING here is a trap: Celery returns PENDING for any task ID it does not recognize, so a task that was dropped on a broker disconnect is indistinguishable from one that was never sent. These are not three bugs to triage separately — they are three symptoms of an unconfigured distributed runtime.

Root Cause

Monolithic payloads and the default runtime. Passing a whole vendor dump as one task argument forces the entire document to be serialized onto the broker and materialized in worker heap at once. Compounding this, Celery’s defaults are tuned for short web tasks, not long-running catalog parsing: unbounded prefetch pulls dozens of large messages per worker, the default pickle serializer bloats payloads (and is a deserialization risk), and workers are never recycled, so allocator fragmentation from repeated large-record parsing never gets reclaimed. The result is monotonic RSS growth until the kernel intervenes.

At-least-once delivery without an idempotency guard. Celery’s brokers guarantee at-least-once delivery. With acks_late enabled (which you want, so a crashed worker’s task is redelivered rather than lost), any task that commits to the ILS and then dies before acking will run again. Without a deterministic dedupe key, the second run re-inserts the record. This is the same idempotency key discipline the parent guide builds into the task envelope — the collision is not a Celery bug, it is a missing contract.

Silent drops from broker instability. Long-running ingest windows hold broker connections open for minutes. Misconfigured TCP keepalive lets an idle NAT or load balancer silently reset the connection; the worker never receives the message, the ack never returns, and the task is left PENDING in the result backend with no exception anywhere.

Solution

Fix all three at the configuration layer first, then add the idempotency guard in task code. The parent Async Batch Processing for Catalog Updates guide covers the full envelope contract these tasks carry; here we focus on the Celery-specific settings.

Step 1 — Route by material type and bound the runtime. Give heavy MARC parsing its own queue and pool so a slow parse cannot starve fast circulation deltas. Lower prefetch, recycle workers, and switch off pickle:

# celery_config.py
task_routes = {
    "catalog.tasks.parse_marc21": {"queue": "marc21.parse"},
    "catalog.tasks.sync_holdings": {"queue": "ils.sync.holdings"},
    "catalog.tasks.apply_circ_delta": {"queue": "circulation.delta.apply"},
}
task_serializer = "json"
result_serializer = "json"
accept_content = ["json"]           # refuse pickle payloads entirely
worker_prefetch_multiplier = 1      # one large record in flight per worker
worker_max_tasks_per_child = 50     # recycle before allocator fragmentation compounds
task_acks_late = True               # ack only after successful commit
task_reject_on_worker_lost = True   # requeue if the worker dies mid-task
result_expires = 3600               # expire result keys after 1 hour

# Stabilize long-running broker connections (root cause of silent PENDING drops)
broker_transport_options = {
    "socket_timeout": 15,
    "socket_keepalive": True,
    "socket_keepalive_options": {
        "TCP_KEEPIDLE": 60,
        "TCP_KEEPINTVL": 10,
        "TCP_KEEPCNT": 5,
    },
}

Step 2 — Stream records instead of loading dumps. Never hand a full file to one task. Dispatch a group of small chunks so heap stays flat regardless of feed size. Each chunk is parsed exactly as Parsing MARC Records with pymarc describes; for the streaming reader details see Optimizing pymarc Performance for Large Record Sets.

import logging
from itertools import islice
from typing import Iterator

from celery import group
from pymarc import MARCReader

from catalog.tasks import sync_marc_record

log = logging.getLogger("catalog.ingest")


def _chunked(iterable: Iterator[dict], size: int) -> Iterator[list[dict]]:
    """Yield fixed-size lists from a record generator without buffering the whole feed."""
    it = iter(iterable)
    while batch := list(islice(it, size)):
        yield batch


def dispatch_vendor_feed(filepath: str, chunk_size: int = 500) -> None:
    """Stream a vendor MARC dump into bounded Celery chunks (constant worker RSS)."""
    with open(filepath, "rb") as fh:
        reader = MARCReader(fh, to_unicode=True, force_utf8=True)
        payloads = (r.as_dict() for r in reader if r is not None)
        for i, batch in enumerate(_chunked(payloads, chunk_size)):
            group(sync_marc_record.s(rec) for rec in batch).apply_async()
            log.info("dispatched_chunk", extra={"chunk": i, "records": len(batch)})

Step 3 — Make every write idempotent. Compute a deterministic fingerprint from the control number plus a content digest and check it before committing. This survives redelivery: the second attempt sees the fingerprint and short-circuits. For rate-limited ILS endpoints, retry with exponential backoff and jitter — the same backoff discipline covered under ILS REST API Polling & Rate Limiting and, for Sierra specifically, Configuring Exponential Backoff for Sierra API Calls.

import hashlib
import logging
import random

from celery import Task

from catalog.app import app
from catalog.errors import RateLimitError
from catalog.ils import push_to_ils
from catalog.state import fingerprint_seen, mark_fingerprint

log = logging.getLogger("catalog.ingest")


def _idempotency_key(record: dict) -> str:
    """Deterministic dedupe key: control number + content digest survives retries."""
    control_number = record["fields"]["001"]
    digest = hashlib.sha256(repr(record["fields"]).encode("utf-8")).hexdigest()
    return f"{control_number}:{digest}"


class CatalogIngestTask(Task):
    abstract = True
    max_retries = 5
    acks_late = True

    def on_failure(self, exc, task_id, args, kwargs, einfo) -> None:
        # Log the failure keyed by task_id only — never the raw record payload.
        log.error("ingest_task_failed", extra={"task_id": task_id, "error": type(exc).__name__})


@app.task(bind=True, base=CatalogIngestTask)
def sync_marc_record(self, record_payload: dict) -> dict:
    key = _idempotency_key(record_payload)
    if fingerprint_seen(key):
        log.info("duplicate_skipped", extra={"idempotency_key": key})
        return {"status": "skipped", "idempotency_key": key}
    try:
        result = push_to_ils(record_payload)
    except RateLimitError as exc:
        countdown = 2 * (2 ** self.request.retries) + random.uniform(0, 1)
        raise self.retry(exc=exc, countdown=countdown)
    mark_fingerprint(key)  # record only after a confirmed commit
    return result

The ordering matters: mark_fingerprint runs after the confirmed ILS commit. If the worker dies between the write and the mark, redelivery re-runs the write — but a well-designed ILS upsert keyed on the control number makes that second write a no-op, so the fingerprint and the upsert are two independent guards against the same duplicate.

Compliance or Privacy Impact

This fix changes the PII surface area, mostly for the better, but with one checkpoint you must not skip. Switching the serializer from pickle to json removes an arbitrary-code-execution vector on the broker and produces payloads you can actually inspect for compliance — but it also means the full record body now sits in the broker and result backend as readable JSON. Any patron-identifiable field (a hold requester, a circulation borrower) must already be masked before the record is enqueued; enqueueing is not a safe place to carry raw PII. Apply the masking rules from PII Masking in Patron Data Exports at the transform stage, upstream of apply_async.

Two further checkpoints: the on_failure handler above logs only the task_id and exception class, never args or kwargs, so a crash cannot spill a record body into your log aggregator. And records that fail schema validation should be diverted, not retried blindly — route them to the quarantine described in Schema Validation for Ingested Records so a malformed payload cannot loop through the retry ladder and pollute the audit trail with repeated failures.

Verification

Confirm each fix independently before promoting the deployment.

Bounded memory. Profile a worker against the full feed with tracemalloc and assert peak RSS stays flat across chunks:

import tracemalloc

tracemalloc.start()
dispatch_vendor_feed("nightly_full.mrc", chunk_size=500)
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
assert peak < 512 * 1024 * 1024, f"peak {peak / 1e6:.0f} MB exceeds worker budget"

Idempotent redelivery. Prove that a second execution of the same payload commits nothing:

first = sync_marc_record.apply(args=[record]).get()
second = sync_marc_record.apply(args=[record]).get()  # simulate broker redelivery
assert second["status"] == "skipped"

No silent drops. After a run, no task should remain PENDING or RECEIVED in the result backend. Enable task_track_started = True and inspect the states — note the modern lowercase setting name, not the legacy CELERY_TASK_TRACK_STARTED:

$ redis-cli --scan --pattern 'celery-task-meta-*' \
    | xargs -n1 redis-cli get \
    | jq -r '.status' | sort | uniq -c
    9847 SUCCESS
      12 SKIPPED

If any PENDING rows survive a completed run, the keepalive options from Step 1 are not taking effect — verify the broker actually received them by checking connection metrics against the ILS API gateway logs to separate network latency from application failures.

Async Batch Processing for Catalog Updates — parent guide: the task-envelope contract, idempotency keys, and dead-letter handling these workers depend on
Optimizing pymarc Performance for Large Record Sets — streaming readers that keep each Celery chunk’s memory flat
Configuring Exponential Backoff for Sierra API Calls — the retry-with-jitter pattern for rate-limited ILS endpoints
Schema Validation for Ingested Records — where to route records that fail validation instead of retrying them