Testing Strategy¶

Status: Draft v1. Audience: Paxman contributors and maintainers. Related docs: ARCHITECTURE.md §10 Testing Architecture, PACKAGE_STRUCTURE.md §19 Test Layout, REPLAY_AND_DETERMINISM.md

This document is the test strategy for Paxman. It defines test seams, test types, fixtures, and the property-based testing approach that underwrites Paxman's determinism and replay claims.

1. Test Layers¶

Paxman uses a 4-layer test pyramid:

              ┌─────────────────────┐
              │   End-to-end (E2E)  │   Real contracts, real inputs
              │   ~10 fixtures      │   Manual + slow CI
              └──────────┬──────────┘
                         │
              ┌──────────┴──────────┐
              │  Integration tests  │   Cross-subsystem, mocked I/O
              │  ~50 cases          │   CI
              └──────────┬──────────┘
                         │
              ┌──────────┴──────────┐
              │    Unit tests       │   One subsystem at a time
              │  ~500 cases         │   CI
              └──────────┬──────────┘
                         │
              ┌──────────┴──────────┐
              │  Property tests     │   Hypothesis-driven
              │  ~50 strategies     │   CI
              └─────────────────────┘

1.1 When to use each¶

Property tests — for determinism, replay equivalence, and structural invariants. Run on every PR.
Unit tests — for specific behaviors and edge cases. Run on every PR.
Integration tests — for cross-subsystem behavior with mocked capabilities. Run on every PR.
End-to-end tests — for full-pipeline behavior with real (but pre-baked) inputs. Run on every PR but allow slow tests separately.

2. Test Seams¶

The package layout is designed around test seams — boundaries at which a subsystem can be tested in isolation by injecting a fake. See PACKAGE_STRUCTURE.md §2 Module DAG for the dependency rules.

Subsystem	Seam	What you inject	What you assert
`contract/`	`adapt(external) → CanonicalContract` and `export(canonical) → external`	External contract fixture	`CanonicalContract` shape, type, validation errors
`planner/`	`plan(canonical, profile, budget, policy, registry) → ExecutionPlan`	Fake `CapabilityRegistry`, fixed `InputProfile`	`ExecutionPlan` shape, plan order, heuristic selection
`capabilities/`	`capability.invoke(ctx) → CapabilityResult`	Provider mock	`CapabilityResult` shape, no `confidence` field
`executor/`	`run(plan, contract, registry, input) → CandidateResult[]`	Mocked capabilities	Invocation order, early stop, budget short-circuit
`reconciler/`	`reconcile(candidates, contract) → ResolvedResult[]`	Crafted candidates	Merging, conflict detection, confidence assignment
`artifact/`	`build(resolved, plan, evidence) → ExecutionArtifact` and `replay(artifact, contract) → ExecutionArtifact`	Fixtures	`replay_hash` determinism, byte-equal rehydration, tamper detection
`api/`	`paxman.normalize(...)` and `paxman.replay(...)`	Subsystem fakes	Public surface stability, end-to-end flow

3. Property Tests (Hypothesis)¶

Property tests are the backbone of Paxman's determinism claims. They generate random inputs and assert invariants.

3.1 Example property tests¶

from hypothesis import given, strategies as st
from paxman import planner
from paxman.testing import contracts, inputs, budgets, policies, registries


@given(
    contract=contracts(),
    input_data=inputs(),
    budget=budgets(),
    policy=policies(),
    registry=registries(),
)
def test_planner_is_deterministic(contract, input_data, budget, policy, registry):
    """Same inputs → same plan."""
    profile = make_profile(input_data)
    plan_a = planner.plan(contract, profile, budget, policy, registry)
    plan_b = planner.plan(contract, profile, budget, policy, registry)
    assert serialize(plan_a) == serialize(plan_b)


@given(artifact=artifacts(), contract=contracts())
def test_replay_is_byte_equal(artifact, contract):
    """Replay reproduces the artifact byte-for-byte."""
    rehydrated = paxman.replay(artifact, contract)
    assert serialize(rehydrated) == serialize(artifact)


@given(contract=contracts(), candidates=candidate_sets())
def test_reconciler_is_monotonic(contract, candidates):
    """Strictly better evidence never lowers confidence."""
    resolved_a = reconciler.reconcile(candidates, contract)
    better_candidates = improve_evidence(candidates)
    resolved_b = reconciler.reconcile(better_candidates, contract)
    for field in resolved_a:
        assert resolved_b[field].confidence >= resolved_a[field].confidence


@given(artifact=artifacts())
def test_hash_detects_modification(artifact):
    """Any modification changes the hash."""
    original_hash = artifact.replay_hash
    modified = dataclasses.replace(artifact, statistics={"invocations": 0})
    assert modified.replay_hash != original_hash

3.2 Custom strategies¶

paxman.testing provides Hypothesis strategies for:

contracts() — generates CanonicalContracts of various shapes.
inputs() — generates raw inputs (text, JSON, CSV, HTML, PDF bytes).
budgets() — generates Budget instances.
policies() — generates Policy instances.
registries() — generates CapabilityRegistry instances with varied capability sets.
candidate_sets() — generates CandidateResult[] for the Reconciler.
artifacts() — generates ExecutionArtifacts.

These strategies are public in the paxman.testing module so downstream tests can use them.

3.3 Shrinking¶

Hypothesis will shrink failing inputs to a minimal counterexample. Paxman's deterministic encoding makes shrinking straightforward.

3.4 Health checks¶

Hypothesis health checks are enabled. Deadline(100ms) is used to catch accidental non-determinism (e.g., a flaky network call).

4. Determinism Tests¶

For every subsystem, two tests:

Property test — same inputs → same outputs.
Replay test — given a fixture artifact, rehydration produces the same JSON hash.

def test_replay_reproduces_hash():
    """The replay of a stored artifact reproduces its hash."""
    fixture = load_fixture("invoices/simple.json")
    rehydrated = paxman.replay(fixture, fixture.contract)
    assert rehydrated.replay_hash == fixture.replay_hash


def test_artifact_hash_is_stable_across_runs():
    """The artifact hash is stable across separate Python invocations."""
    artifact_a = normalize_in_subprocess("invoices/simple.json")
    artifact_b = normalize_in_subprocess("invoices/simple.json")
    assert artifact_a.replay_hash == artifact_b.replay_hash

5. Replay Tests¶

def test_replay_succeeds_on_valid_artifact():
    artifact = paxman.normalize(input_data, contract, budget, policy)
    rehydrated = paxman.replay(artifact, contract)
    assert serialize(rehydrated) == serialize(artifact)


def test_replay_detects_tampering():
    artifact = paxman.normalize(input_data, contract, budget, policy)
    tampered = dataclasses.replace(artifact, statistics={"invocations": 0})
    with pytest.raises(paxman.HashMismatchError):
        paxman.replay(tampered, contract)


def test_replay_rejects_unsupported_paxman_version():
    artifact = paxman.normalize(input_data, contract, budget, policy)
    with mock.patch("paxman.__version__", "99.0.0"):
        with pytest.raises(paxman.VersionMismatchError):
            paxman.replay(artifact, contract)


def test_replay_rejects_unsupported_capability_version():
    artifact = paxman.normalize(input_data, contract, budget, policy)
    with mock.patch.dict(registry.capabilities, {}, clear=True):
        with pytest.raises(paxman.CapabilityNotFoundError):
            paxman.replay(artifact, contract)

6. Contract Adapter Tests¶

For each adapter:

Unit tests — representative contracts → canonical form.
Golden tests — frozen canonical snapshots.
Roundtrip tests — canonical → external → canonical is identity (or as close as the format allows).
Validation tests — every error path.

def test_pydantic_adapter_handles_required_field():
    class Invoice(paxman.BaseModel):
        total: float
        currency: str
    canonical = paxman.contract.adapt(Invoice)
    assert canonical.fields["total"].required is True
    assert canonical.fields["total"].type == FieldType.DECIMAL


def test_pydantic_adapter_roundtrip_preserves_structure():
    class Invoice(paxman.BaseModel):
        total: float
    canonical = paxman.contract.adapt(Invoice)
    exported = paxman.contract.export(canonical, format="pydantic")
    assert exported is not None

7. Capability Tests¶

For each capability:

Known-input tests — fixture input → expected candidates.
Determinism tests for deterministic capabilities (regex, validation).
Provider-mock tests for non-deterministic capabilities (text_extraction, inference).
Spec tests — CapabilitySpec exposes the right metadata.

def test_regex_capability_extracts_email():
    cap = RegexExtraction(pattern=r"[\w.]+@[\w.]+")
    result = cap.invoke(CapabilityContext(input="Email me at jane@acme.com"))
    assert "jane@acme.com" in [c.value for c in result.candidates]


def test_capability_result_has_no_confidence():
    """Capabilities never assign confidence."""
    cap = RegexExtraction(pattern=r"\d+")
    result = cap.invoke(CapabilityContext(input="42"))
    assert not hasattr(result, "confidence")
    for c in result.candidates:
        assert not hasattr(c, "confidence")

8. End-to-End Fixtures¶

Curated (input, contract, expected_artifact) fixtures that exercise every capability and adapter. The fixtures are checked into the repo under tests/fixtures/.

The full directory layout, dataset catalog, and licensing policy are in docs/TEST_DATA.md. The 5-layer model — synthetic edge cases → programmatic → curated → open-dataset → real production — is described there. Briefly:

tests/fixtures/
├── contracts/         # LAYER 3: curated contracts (Pydantic, JSON Schema, Dict DSL, OpenAPI)
├── inputs/            # LAYER 4: open-dataset samples (CORD, InvoiceBenchmark, alamgirqazi, ...)
│   ├── invoices/
│   ├── receipts/
│   ├── quotations/
│   ├── procurement/
│   ├── multi_page/
│   └── adversarial/   # LAYER 1: edge cases
├── artifacts/         # LAYER 3: golden ExecutionArtifact JSON
└── generated/         # LAYER 2: programmatic fixtures (gitignored)

The vendored datasets are MIT / Apache-2.0 / CC-BY-4.0 / CC0 only. License-restricted datasets (SROIE, FATURA, INV-CDIP) are research-only and are not vendored — they are downloaded by individual developers via python scripts/fetch_test_data.py --dev-only <name>. See tests/fixtures/DATASET_LICENSES.md for the full attribution catalog.

Each E2E fixture:

Defines an input (under inputs/) and a contract (under contracts/).
Runs paxman.normalize(input, contract).
Asserts the artifact's status, normalized_data, evidence, and replay_hash match a golden file (under artifacts/).
Replays the artifact and asserts byte-equal rehydration.

Fixtures are run in CI on every PR.

9. Coverage Targets¶

Subsystem	Target line coverage	Notes
`contract/`	≥ 90%	Adapter + validator
`planner/`	≥ 90%	All heuristics
`capabilities/`	≥ 85%	All V1 capabilities
`executor/`	≥ 90%	All control flow paths
`reconciler/`	≥ 90%	All merging strategies + MONEY
`artifact/`	≥ 95%	Replay is critical
`errors.py`	100%	Every error path
`versioning.py`	100%	Every version check
`api/`	≥ 90%	Public surface
Overall	≥ 90%	Enforced by `--cov-fail-under=90`

Coverage is reported on every PR via pytest-cov.

10. Mocking Strategy¶

10.1 What to mock¶

Network calls — never real. Use httpx mocks or similar.
Inference providers — always mocked in unit tests; real providers only in E2E with a flag.
Filesystem — only in tests that need it. Use tmp_path.
Clock — always injected. Use a FakeClock fixture.

10.2 What NOT to mock¶

Internal subsystems — use real implementations across the boundary. Test internal contracts by passing crafted inputs.
The deterministic JSON encoder — never mock serialization; it is part of the contract.

10.3 Mock framework¶

pytest-mock is the default. unittest.mock is acceptable for stdlib-only contexts.

11. CI Configuration¶

CI runs on every PR and on every push to main.

11.1 Required checks¶

ruff check — lint
ruff format --check — format
mypy --strict src/paxman — type-check public surface
pyright — type-check (cross-validation)
import-linter — module DAG
pytest with --cov-fail-under=90 — tests with coverage
interrogate — docstring coverage on public surface
bandit — security lint
pip-audit — dependency vulnerabilities

11.2 Optional checks (run nightly)¶

Long-running property tests (Hypothesis(derandomize=True, max_examples=10000))
Cross-Python-version tests (3.11, 3.12, 3.13)
Cross-OS tests (ubuntu, macOS, windows)
Performance benchmarks (pytest-benchmark)
Mutation testing (mutmut)

12. Test Seams in Practice¶

12.1 Testing the planner¶

def test_planner_prefers_deterministic_capabilities():
    contract = make_contract(Invoice)
    profile = InputProfile(input_type="text/plain", size=1024)
    registry = make_registry([
        DeterministicCapability("regex_extraction"),
        NonDeterministicCapability("inference"),
    ])
    plan = planner.plan(contract, profile, Budget(), Policy(), registry)
    first_step = plan.field_plans[0].capability_chain[0]
    assert first_step.capability_id == "regex_extraction"

12.2 Testing the executor¶

def test_executor_short_circuits_on_budget_exhaustion():
    plan = make_plan_with_expensive_capability()
    budget = Budget(max_total_cost_usd=0.001)
    artifact = executor.run(plan, contract, registry, input_data, budget, policy)
    assert artifact.status in (Status.PARTIAL_SUCCESS, Status.UNRESOLVED)
    assert any(d.code == "BUDGET_EXCEEDED" for d in artifact.diagnostics)

12.3 Testing the reconciler¶

def test_reconciler_picks_higher_evidence_candidate():
    candidates = [
        Candidate(value="ACME", evidence=[weak_evidence]),
        Candidate(value="ACME Corp", evidence=[strong_evidence]),
    ]
    contract = make_contract(Invoice, supplier_name_required=True)
    resolved = reconciler.reconcile(candidates, contract)
    assert resolved["supplier_name"].value == "ACME Corp"

13. Mutation Testing (V2)¶

V2 will use mutmut or cosmic-ray to verify that the test suite catches introduced mutations. A mutation score ≥ 80% is the target.

14. See also¶

docs/TEST_DATA.md — the 5-layer test data model, dataset catalog, and licensing policy
tests/fixtures/DATASET_LICENSES.md — full attribution for every vendored dataset
scripts/fetch_test_data.py — the download script
ARCHITECTURE.md §10 Testing Architecture
PACKAGE_STRUCTURE.md §19 Test Layout
REPLAY_AND_DETERMINISM.md — replay and determinism property tests
DEVELOPMENT.md — running tests locally