Data Retention Policies for Public Libraries: Engineering Deterministic Lifecycle Workflows

Operating within the broader Patron Validation & Privacy Data Routing architecture, this guide covers the problem every public library data team eventually inherits: patron records, circulation logs, and hold history accumulate indefinitely unless something deliberately ages them out, and the statutes governing that ageing are strict, jurisdiction-specific, and audited. Library-tech staff hit this the moment a records officer asks how long checkout history is kept, a state privacy statute mandates deletion after a fixed window, or an analytics warehouse is discovered to be holding identifiable circulation rows years past their retention horizon. Retention has to be engineered as a deterministic, replayable workflow — not a quarterly manual purge that nobody can prove happened correctly.

This page walks the retention contract end to end: the policy schema you enforce, the environment you need, an annotated Python implementation of the manifest-driven lifecycle engine, the compliance checkpoints for patron-identifiable fields, the quarantine patterns that keep one malformed manifest from deleting the wrong records, how to keep it performant at multi-million-row scale, and how to verify a run before it touches production. Every retention action described here is designed to be idempotent and to leave a cryptographically verifiable trail behind it.

Specification & Contract

A retention pipeline is only defensible if its behaviour is declared, versioned, and reproducible. The contract has one governing rule: no record changes state except as the direct result of a manifest entry evaluated against a timestamp. Every purge, archive, or pseudonymization is traceable to a specific policy version and a specific horizon calculation, so an auditor can replay the decision. The manifest is the single source of truth; the code is just an interpreter for it.

Each manifest entry maps a record type to a retention window and a terminal action. The table below is the minimum contract most public library pipelines start from. Treat each row as a rule you enforce in tests, not a default the code invents — an unmapped record type is a policy gap that routes to review, never a silent “keep forever.”

Field	Type	Example	Meaning
`record_type`	string	`circulation_log`	The dataset the rule governs; joins to a physical table or export
`retention_days`	int	`1095`	Horizon in days from the record’s anchor timestamp before action is due
`action`	enum	`pseudonymize`	Terminal transition: `archive`, `pseudonymize`, or `hard_delete`
`requires_referential_check`	bool	`true`	Whether linked rows (holds, fines, ILL) must be clear before mutation
`anchor_field`	string	`checkout_date`	The column whose value the horizon is measured from
`jurisdiction`	string	`state:OR`	Provenance for the rule; becomes the audit partition key

Two rules govern the whole contract. First, the three terminal actions are not interchangeable. archive moves a record to cold, access-controlled storage with identifiers intact; pseudonymize severs the link between the record and the patron while preserving aggregate analytical value; hard_delete is irreversible cryptographic erasure. Choosing the wrong one is a compliance incident, so the action is declared in the manifest, never inferred at runtime. Second, provenance is preserved: the jurisdiction plus the policy version become the partition key for every audit entry, so any state transition can be traced back to the exact rule and release that caused it. The identifier-severing logic in pseudonymize shares its hashing discipline with PII masking in patron data exports, and the record-level anonymization it performs on checkout rows is the same operation described in Circulation History Routing & Anonymization.

Prerequisites & Environment Setup

Retention jobs mutate or destroy production data, so the environment must be locked down before the first dry run. This pipeline targets Python 3.11+ (for datetime.UTC and typed Literal support) with pydantic>=2.5, sqlalchemy>=2.0, and a PostgreSQL backend with the pgcrypto extension enabled for in-database hashing.

Python 3.11 or newer available in an isolated virtualenv, dependencies pinned via uv or pip-tools.
pydantic>=2.5 and sqlalchemy>=2.0 installed; versions committed to a lockfile.
PostgreSQL reachable with a least-privilege role scoped to the retention tables only — no superuser.
CREATE EXTENSION IF NOT EXISTS pgcrypto; run once by a DBA so digest() is available.
RETENTION_DB_URL and PII_HMAC_SALT supplied as environment variables, never hardcoded — the salt comes from a secrets manager at runtime.
A read-only production snapshot available for dry-run evaluation in CI.
The retention manifest committed to version control and code-reviewed like any other infrastructure change.
A dead-letter destination (table or message topic) provisioned for records that fail validation.

The salt and connection string are read from the environment so the same image runs unchanged across staging and production. If either is absent the pipeline must refuse to start rather than fall back to a default.

Core Implementation

The engine has three responsibilities: load and validate the manifest, evaluate each record’s timestamp against its horizon, and dispatch the declared terminal action inside a transaction. Each is a discrete, testable unit.

Step 1 — Load and validate the manifest

The manifest is parsed into typed Pydantic models so a malformed rule fails loudly at load time, before any record is touched. A schema violation here aborts the run — it never degrades to “skip this policy.”

import json
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Literal

from pydantic import BaseModel, Field, ValidationError

RetentionAction = Literal["archive", "pseudonymize", "hard_delete"]


class RetentionPolicy(BaseModel):
    record_type: str
    retention_days: int = Field(gt=0)
    action: RetentionAction
    anchor_field: str
    jurisdiction: str
    requires_referential_check: bool = True


class RetentionManifest(BaseModel):
    version: str
    effective_date: datetime
    policies: list[RetentionPolicy]

    def policy_for(self, record_type: str) -> RetentionPolicy | None:
        return next((p for p in self.policies if p.record_type == record_type), None)


def load_manifest(path: Path) -> RetentionManifest:
    """Parse and validate a retention manifest; raises on any schema violation."""
    try:
        raw = json.loads(path.read_text(encoding="utf-8"))
        return RetentionManifest.model_validate(raw)
    except (json.JSONDecodeError, ValidationError) as exc:
        # A bad manifest must never fall through to a partial run.
        raise RuntimeError(f"Refusing to run: invalid retention manifest {path}: {exc}") from exc

Pitfall: retention_days is constrained with Field(gt=0). A zero or negative window would mark every record eligible for deletion on the next run — the validator turns that class of typo into a load-time error instead of a data-loss incident.

Step 2 — Evaluate the horizon

Horizon evaluation is a pure function of a record’s anchor timestamp, the policy, and the current time. Keeping it pure means it is trivially unit-testable with a frozen clock and carries no hidden state.

def is_past_horizon(
    policy: RetentionPolicy,
    anchor_timestamp: datetime,
    now: datetime | None = None,
) -> bool:
    """True when a record governed by `policy` has crossed its retention horizon."""
    current = now or datetime.now(timezone.utc)
    if anchor_timestamp.tzinfo is None:
        raise ValueError("anchor_timestamp must be timezone-aware to compare horizons")
    horizon = anchor_timestamp + timedelta(days=policy.retention_days)
    return current >= horizon

Pitfall: naive (timezone-unaware) datetimes are the classic source of off-by-a-day retention bugs. The function rejects them outright rather than silently assuming local time, which would shift horizons by the server’s UTC offset.

Step 3 — Dispatch the terminal action

The three actions are dispatched explicitly. pseudonymize and hard_delete are irreversible, so both run inside a transaction and both refuse to proceed unless the referential-integrity gate is clear. A dry_run flag lets the identical code path report what it would do without committing.

import logging
from dataclasses import dataclass

from sqlalchemy import text
from sqlalchemy.orm import Session

log = logging.getLogger("retention.engine")


@dataclass
class ActionResult:
    record_type: str
    action: str
    affected: int
    committed: bool


def referential_conflicts(session: Session, record_type: str, cutoff: datetime) -> int:
    """Count linked rows (holds, fines, ILL) that block mutation of eligible records."""
    row = session.execute(
        text(
            """
            SELECT count(*) FROM record_links
            WHERE parent_type = :rt
              AND created_at < :cutoff
              AND resolved = false
            """
        ),
        {"rt": record_type, "cutoff": cutoff},
    ).scalar_one()
    return int(row)


def apply_action(
    session: Session,
    policy: RetentionPolicy,
    cutoff: datetime,
    dry_run: bool = True,
) -> ActionResult:
    """Execute one policy's terminal action inside a transaction; rollback on any failure."""
    if policy.requires_referential_check:
        blocked = referential_conflicts(session, policy.record_type, cutoff)
        if blocked:
            raise ReferentialIntegrityError(
                f"{blocked} unresolved links block {policy.action} on {policy.record_type}"
            )

    statements = {
        "hard_delete": text(
            "DELETE FROM records WHERE record_type = :rt AND anchor_ts < :cutoff"
        ),
        "pseudonymize": text(
            """
            UPDATE records
            SET patron_hash = encode(digest(patron_id::text, 'sha256'), 'hex'),
                patron_id = NULL,
                retention_state = 'pseudonymized'
            WHERE record_type = :rt AND anchor_ts < :cutoff
            """
        ),
        "archive": text(
            "UPDATE records SET retention_state = 'archived' "
            "WHERE record_type = :rt AND anchor_ts < :cutoff"
        ),
    }

    result = session.execute(statements[policy.action], {"rt": policy.record_type, "cutoff": cutoff})
    affected = result.rowcount

    if dry_run:
        session.rollback()
        log.info("dry-run: %s would affect %d rows", policy.action, affected)
        return ActionResult(policy.record_type, policy.action, affected, committed=False)

    session.commit()
    return ActionResult(policy.record_type, policy.action, affected, committed=True)

Declaring retention rules this way lets administrators version-control them alongside infrastructure-as-code, so a change to a state’s retention window is a reviewed pull request with an audit trail, not an untracked edit to a stored procedure.

PII & Compliance Checkpoints

Retention is where the most sensitive fields in the catalog are handled, so the compliance gates are non-negotiable. Any record leaving the retention boundary — for a compliance report, a reconciliation extract, or a downstream warehouse — must pass a deterministic masking gate first. Masking uses a keyed HMAC so that the same input always produces the same token (preserving join keys for audit lookups) while remaining irreversible. The full boundary model and salt-handling rules live in Data Privacy Boundaries in Library Systems; this pipeline enforces them at the egress edge.

import hashlib
import hmac
import os

from pydantic import BaseModel, model_validator


def load_salt() -> bytes:
    salt = os.environ.get("PII_HMAC_SALT")
    if not salt:
        raise RuntimeError("PII_HMAC_SALT is not set; refusing to emit patron data")
    return salt.encode("utf-8")


def deterministic_token(raw: str, salt: bytes) -> str:
    """Stable, irreversible token; HMAC-SHA256 keeps join-key parity across runs."""
    return hmac.new(salt, raw.encode("utf-8"), hashlib.sha256).hexdigest()[:24]


class RetentionExportRow(BaseModel):
    patron_token: str
    record_type: str
    retention_state: str

    @model_validator(mode="after")
    def reject_plaintext(self) -> "RetentionExportRow":
        # Guard against a raw email or barcode leaking into the token slot.
        if "@" in self.patron_token or self.patron_token.isdigit():
            raise ValueError("patron_token looks like raw PII; masking gate failed")
        return self

Three checkpoints apply on every run. First, hard_delete must be cryptographic, not a logical flag — expired identifiable rows are erased, not merely hidden. Second, pseudonymize must sever the patron link in the same transaction that writes the hash, so a crash can never leave a half-anonymized row that is both identifiable and marked anonymous. Third, no export row leaves the boundary without passing the masking validator above; a row that fails it is quarantined, never emitted. These rules keep retention consistent with the identity-side controls in Threshold Tuning for Identity Validation, which governs the same patron records upstream.

Error Handling & Quarantine Patterns

The overriding principle is that a failure must never leave records in an inconsistent state, and it must never silently continue. Every mutation runs inside a transaction with explicit rollback, and every unresolvable record is routed to a dead-letter destination for human review rather than skipped.

class ReferentialIntegrityError(Exception):
    """Raised when linked rows block a retention action."""


class RetentionError(Exception):
    """Base class for recoverable retention failures routed to quarantine."""


def run_policy(session: Session, policy: RetentionPolicy, cutoff: datetime, dry_run: bool) -> ActionResult:
    try:
        return apply_action(session, policy, cutoff, dry_run=dry_run)
    except ReferentialIntegrityError as exc:
        session.rollback()
        quarantine(session, policy, reason="referential_conflict", detail=str(exc))
        log.warning("quarantined %s: %s", policy.record_type, exc)
        raise RetentionError(str(exc)) from exc
    except Exception:
        # Unknown failure: roll back and re-raise so the batch aborts loudly.
        session.rollback()
        log.exception("unhandled failure applying %s to %s", policy.action, policy.record_type)
        raise


def quarantine(session: Session, policy: RetentionPolicy, reason: str, detail: str) -> None:
    session.execute(
        text(
            "INSERT INTO retention_quarantine (record_type, reason, detail, seen_at) "
            "VALUES (:rt, :reason, :detail, now())"
        ),
        {"rt": policy.record_type, "reason": reason, "detail": detail},
    )
    session.commit()

Referential conflicts — an expired checkout that still has an unresolved fine, for instance — are recoverable: they route to a quarantine table with enough context for staff to clear the linked row and re-run. Unknown exceptions abort the batch, because a retention job that keeps going after an error it does not understand is more dangerous than one that stops. This is the same discipline used by the ingestion-side schema validation quarantine queue, applied at the deletion edge instead of the ingest edge.

Performance Considerations

At public library scale, circulation history is the dominant table and frequently reaches tens of millions of rows, so retention must be bounded and batched rather than issuing one unbounded DELETE or UPDATE. A single statement touching millions of rows takes a long lock, bloats the write-ahead log, and can stall the ILS during business hours.

def anonymize_in_batches(session: Session, cutoff: datetime, batch_size: int = 1000) -> int:
    """Bounded, resumable anonymization; each batch commits so a crash loses at most one batch."""
    total = 0
    while True:
        result = session.execute(
            text(
                """
                UPDATE circulation_logs
                SET patron_hash = encode(digest(patron_id::text, 'sha256'), 'hex'),
                    patron_id = NULL,
                    retention_state = 'anonymized'
                WHERE ctid IN (
                    SELECT ctid FROM circulation_logs
                    WHERE checkout_date < :cutoff AND retention_state = 'active'
                    LIMIT :batch_size
                )
                """
            ),
            {"cutoff": cutoff, "batch_size": batch_size},
        )
        session.commit()
        total += result.rowcount
        if result.rowcount == 0:
            return total

Three levers keep it healthy: a covering index on (retention_state, checkout_date) so the batch selector is an index scan rather than a sequential one; a per-batch commit so a mid-run failure loses at most one batch and the job resumes where it stopped; and scheduling the run in an off-peak maintenance window with connection pooling and exponential backoff so it never contends with live circulation traffic. Stream evaluation over a server-side cursor instead of materializing candidate rows into memory — the pipeline should hold one batch at a time, not the whole eligible set.

Verification & Testing

Because the failure mode is irreversible data loss, verification runs before any production mutation. The pipeline ships with a mandatory dry-run mode that evaluates the manifest against a production snapshot and reports affected counts without committing.

from datetime import datetime, timezone


def test_zero_window_is_rejected():
    """A manifest with retention_days <= 0 must fail validation, not delete everything."""
    import pytest
    from pydantic import ValidationError

    with pytest.raises(ValidationError):
        RetentionPolicy(
            record_type="circulation_log",
            retention_days=0,
            action="hard_delete",
            anchor_field="checkout_date",
            jurisdiction="state:OR",
        )


def test_horizon_boundary_is_inclusive():
    policy = RetentionPolicy(
        record_type="circulation_log",
        retention_days=30,
        action="pseudonymize",
        anchor_field="checkout_date",
        jurisdiction="state:OR",
    )
    anchor = datetime(2026, 1, 1, tzinfo=timezone.utc)
    exactly_due = datetime(2026, 1, 31, tzinfo=timezone.utc)
    assert is_past_horizon(policy, anchor, now=exactly_due) is True
    assert is_past_horizon(policy, anchor, now=datetime(2026, 1, 30, tzinfo=timezone.utc)) is False

Validate three things on every release: that horizon math is inclusive at the exact boundary and correct across daylight-saving transitions (freeze the clock in tests rather than calling now()); that a dry run’s reported counts match the committed counts of a subsequent real run against the same snapshot; and that the masking validator rejects a deliberately planted raw-PII row. Assert on the emitted audit log too — a retention run that mutates rows without a matching audit entry is a compliance failure even if the data change was correct.

Every action must produce an immutable, replayable audit trail. A structured JSON logger that records the policy version, the operator or service account, the affected count, and a hash of the pre-mutation state satisfies public-sector disposal standards such as NIST SP 800-53 Rev. 5, which mandates verifiable sanitization and retention tracking. Ship those logs to a WORM-enabled aggregator so auditors can reconstruct exactly what happened without querying production directly.

Troubleshooting & Frequently Asked Questions

Why did a retention run delete far more rows than the dry run predicted?

The two runs almost certainly used different reference times or different manifests. Horizon evaluation is time-dependent, so a dry run at 09:00 and a real run at 23:00 can classify additional records as eligible. Pin now to a single job-start timestamp passed through every policy evaluation, and assert that the manifest version logged by the dry run matches the one the real run loaded. If they differ, an uncommitted manifest edit slipped in between the two runs.

The overnight job aborted partway with a `ReferentialIntegrityError`. How do I recover?

That is the safety mechanism working. The offending records are in retention_quarantine with a reason of referential_conflict and a detail message naming the blocking links. Resolve the linked rows (settle the fine, close the hold, complete the ILL return), then re-run — the job is idempotent, so already-processed policies are no-ops and only the previously blocked records are re-evaluated. Never bypass the referential check to force the deletion through.

Records show `retention_state = 'anonymized'` but `patron_id` is still populated. What happened?

A crash occurred between the two writes of a non-transactional anonymization. This is exactly the inconsistency the two-phase, single-transaction pattern in Step 3 prevents: the hash write and the patron_id = NULL write must commit together. Audit for any row where retention_state is a terminal value but an identifier column is non-null, re-run the anonymization over those rows, and confirm your implementation is not committing between the two statements.

An auditor asked me to prove a specific patron’s history was purged on a given date. I can’t find it in the database — is that a problem?

No — that absence is the point of hard_delete, but you prove it from the audit trail, not the live data. The structured audit log retains the policy version, the affected count, the operator, and the pre-mutation state hash for the run that removed those rows, without retaining the patron-identifiable content itself. Query the audit sink by jurisdiction and date to produce the disposal certificate.

Can I run retention against the live ILS database during opening hours?

Avoid it. Unbounded statements take long locks and can stall circulation. Use the batched, per-commit pattern from Performance Considerations, schedule the job in an off-peak window, and cap concurrency with a pooled least-privilege connection and exponential backoff so it yields to live traffic.

Patron Validation & Privacy Data Routing — the parent architecture this retention layer operates within.
PII Masking in Patron Data Exports — the deterministic masking gate every retention export passes through.
Circulation History Routing & Anonymization — the record-level anonymization the pseudonymize action performs on checkout history.
Threshold Tuning for Identity Validation — the upstream identity controls on the same patron records.
Data Privacy Boundaries in Library Systems — the boundary and salt-handling model the compliance checkpoints enforce.