Skip to content

Test Data Strategy

Status: Draft v1. Audience: Paxman contributors, maintainers, and downstream integrators. Related docs: TESTING_STRATEGY.md, DEVELOPMENT.md, EXTENDING.md, SECURITY.md, DEPENDENCIES.md

This document is the single source of truth for test data in Paxman. It defines:

  1. The 5-layer test data model (synthetic → real).
  2. The V1 dataset catalog — every dataset Paxman uses, with license, version, and intended use.
  3. The licensing policy — what's allowed in a commercially-distributed library.
  4. The vendor procedure — how to add or update a vendored dataset.
  5. The attribution and provenance — how to credit every dataset.
  6. The security and PII policy for test data.

Why this doc exists: Paxman normalizes data. The test suite must cover the full range of inputs, contracts, and edge cases. Test data shapes the contract canonical model, the planner heuristics, the reconciler logic, and the replay assertions. Designing the test data alongside the architecture (rather than retrofitting it later) is the cheaper path.


1. The 5-Layer Test Data Model

Paxman organizes test data in five layers, from synthetic to real:

Layer 5 — Real production data       (caller-supplied, NEVER committed)
Layer 4 — Public open datasets       (curated subsets, vendored in repo)
Layer 3 — Curated fixtures           (hand-picked, golden artifacts committed)
Layer 2 — Programmatic fixtures      (generated by factory_boy / faker / hypothesis)
Layer 1 — Synthetic edge cases       (purpose-built: malformed, adversarial, regression)

1.1 What each layer is for

Layer Source Committed? Size Purpose Authored by
1. Synthetic edge cases Hand-written or generated in-test Yes (the small set) Tiny (<1 MB) Regression tests for known bugs, adversarial inputs, malformed data Paxman team
2. Programmatic fixtures factory_boy + faker + hypothesis No (gitignored) Generated per test run Property tests, fuzzing, large variation coverage Test code
3. Curated fixtures Hand-picked from layer 4 Yes ~10–50 MB Integration tests, replay golden artifacts Paxman team
4. Public open datasets Vendored from external open sources Yes (subset only) ~50 MB for V1 End-to-end smoke tests, real-world OCR noise, multi-currency External + curated
5. Real production data Caller's actual data NEVER n/a Production validation Caller

1.2 The mapping to test layers

Test type (from TESTING_STRATEGY.md §1) Data layer used
Property tests Layer 2 (programmatic)
Unit tests Layer 1 (edge cases) + small Layer 3 fixtures
Integration tests Layer 3 (curated) + Layer 2 (generated)
End-to-end tests Layer 4 (open datasets) + Layer 3 (golden artifacts)
Production validation Layer 5 (real data) — never committed

2. V1 Dataset Catalog

This is the authoritative list of datasets Paxman uses in V1. Every entry has been verified for license compatibility with a commercially-distributed library.

2.1 Receipt and invoice documents (end-to-end, replay, evidence, MONEY)

Dataset License Size What it gives Paxman Layer V1 use
clovaai/cord CC-BY-4.0 1,000 receipts (1,000 OCR'd) Receipt parsing with bbox + ground truth 4 Required
naver-clova-ix/cord-v1 CC-BY-4.0 1,000 receipts CORD v1 mirror on HuggingFace 4 Required (cleaner mirror)
naver-clova-ix/cord-v2 CC-BY-4.0 1,000 receipts CORD v2 with sub_group_id 4 Recommended
ICDAR 2019 SROIE Research-only (per ICDAR-2019) 1,000 receipts Real scanned receipts, key info extraction 4 Required for OCR noise (development only — see §4.3)
jngb-labs/InvoiceBenchmark MIT 200 synthetic + 200 PDF + 200 PNG Cent-perfect ground truth, multi-currency, deterministic 4 Required for V1 MONEY tests
alamgirqazi/invoice-ocr-synthetic Apache-2.0 3,000 invoices Full field-level labels, line items, multiple currencies 4 Required
AmineTibari/InvoiceJSON CC-BY-4.0 340 invoices Image + JSON structured annotations 4 Recommended
wizenheimer/invoices_receipts_ocr_v1 MIT Subset of invoices-and-receipts_ocr_v1 OCR'd invoices with structured fields 4 Recommended
kaydee/wildreceipt Apache-2.0 ~1,500 wild receipts "In-the-wild" receipts (harder than CORD/SROIE) 4 Recommended
mathieu1256/FATURA CC-BY-NC-4.0 10,000 invoices, 50 layouts Multi-layout invoices with bounding boxes 4 V2 only (NC license)
salesforce/inv-cdip CC-BY-NC-4.0 350 labeled + 200k unlabeled Real invoices with 7 fields 4 V2 only (NC license)
Voxel51/high-quality-invoice-images-for-ocr Research 8,181 synthetic (1,489 labeled) Synthetic invoices, tabular layouts 4 Optional V2

V1 corpus plan (committing to repo):

Source Files vendored Estimated size
CORD test split (CC-BY-4.0) 100 receipts ~5 MB
InvoiceBenchmark (MIT) 200 invoices (full) ~10 MB
alamgirqazi (Apache-2.0) 500 invoices ~15 MB
wildreceipt (Apache-2.0) 200 receipts ~5 MB
Total V1 ~1,000 documents ~35 MB

SROIE is not vendored in the repo (research-only license) but is referenced in the test suite via a documented download path. See §3.

2.2 Contracts (Pydantic, JSON Schema, OpenAPI, Dict DSL)

Source License What it gives Paxman V1 use
JSON-Schema-Test-Suite BSD-3-Clause / Apache-2.0 (per file) Every JSON Schema feature tested across drafts Required for the JSON Schema adapter
OpenAPI Petstore v3.0 MIT Canonical OpenAPI 3.0 example Required for the OpenAPI adapter smoke test
OpenAPI Petstore v3.1 MIT OpenAPI 3.1 with $ref, oneOf, etc. Required
openapi-json-schema-tools test specs Apache-2.0 3.0/3.1 unit test specs Recommended
OQO (APH123614/oqo) MIT (code) / CC-BY-4.0 (data) Open-Quote-Object JSON Schema, 72 anonymized quotes Highly recommended for the quotation use case
OpenAPI-Specification examples Apache-2.0 ~20 official example specs Required — covers all common OAS features

2.3 Procurement data (use case C)

Source License Size V1 use
TED (Tenders Electronic Daily) Commission Decision 2011/833/EU 200k+ notices/yr Required for the procurement use case
atlasprzetargow/polish-tenders-dataset CC-BY-4.0 (data) / MIT (code) 1.4M Polish + EU notices Recommended
OpenMLDatasets/ted_2025_07_sample CC0 (sample) / commercial (full) 100 sample / 260k+ full Required (sample) — easy entry point
Urwashanza/europrocure-10-public-procurement (research) 1.5M records, 2016–2025 V2 only (research license)
GTI Global Public Procurement Dataset (GPPD) CC BY-NC 3.0 42 countries, 2006–2021 V2 only (NC license)
Valan procurement sample MIT (code) / free for eval (data) 1,000-row sample Optional V2

2.4 Multi-page and long-form documents

Source License What it gives
lmms-lab/MP-DocVQA MIT Multi-page VQA — 5,000+ pages
theatticusproject/cuad CC-BY-4.0 Contract Understanding Atticus Dataset — 510 contracts, 13k+ annotations
AgamiAI/Indian-Bank-Statements (via Thoughtworks benchmark) Apache-2.0 Bank statements, multi-page tables

2.5 Synthetic data generators (programmatic, in-test)

These are libraries, not datasets. They generate Layer 2 fixtures at test time.

Library License Purpose V1 use
Faker MIT Names, addresses, emails, currencies, dates Required — core generator
factory_boy MIT Build Pydantic/dataclass factories using Faker Required for V1 test code
pydantic-factories MIT Auto-derive factory from Pydantic model Recommended for Pydantic-heavy tests
Hypothesis MPL-2.0 Property-based test data generation Required — already in dev deps
hypothesis-jsonschema MPL-2.0 Generate instances from JSON Schema Required for contract/ adapter tests
dirty-equals MIT Fuzzy comparison for tests Required for replay-equality tests

3. Directory layout

The full V1 test data directory tree, mirroring the 5-layer model:

tests/fixtures/
├── README.md                          # quick orientation
├── DATASET_LICENSES.md                # attribution for every vendored file
├── contracts/                         # LAYER 3: curated contracts
│   ├── pydantic/
│   │   ├── invoice.py
│   │   ├── quotation.py
│   │   ├── procurement.py
│   │   ├── receipt.py
│   │   ├── multi_page.py
│   │   └── edge_cases/
│   │       ├── empty_model.py
│   │       ├── deeply_nested.py
│   │       ├── with_money.py          # for MONEY type tests
│   │       ├── all_v1_types.py        # exercises every field type
│   │       └── invalid_*.py           # validator tests
│   ├── json_schema/
│   │   ├── invoice.json
│   │   ├── quotation.json
│   │   ├── procurement.json
│   │   ├── receipt.json
│   │   ├── multi_page.json
│   │   └── drafts/                    # covers all JSON Schema drafts
│   │       ├── draft-04.json
│   │       ├── draft-06.json
│   │       ├── draft-07.json
│   │       ├── draft-2019-09.json
│   │       └── draft-2020-12.json
│   ├── dict_dsl/
│   │   ├── invoice.py
│   │   ├── quotation.py
│   │   ├── procurement.py
│   │   └── edge_cases/
│   └── openapi/
│       ├── petstore_3_0.yaml
│       ├── petstore_3_1.yaml
│       ├── procurement_api.yaml
│       └── edge_cases/
├── inputs/                            # LAYER 4: open dataset samples
│   ├── invoices/
│   │   ├── synthetic/                 # small synthetic inputs for smoke tests
│   │   │   ├── invoice_plain.txt
│   │   │   ├── invoice_email.txt
│   │   │   └── invoice_csv.csv
│   │   ├── cord/                      # vendored CORD samples (CC-BY-4.0)
│   │   │   ├── cord_sample_001.png
│   │   │   ├── cord_sample_001.json
│   │   │   └── ...                    # ~100 files
│   │   ├── invoicebench/              # vendored InvoiceBenchmark (MIT)
│   │   │   ├── invoicebench_001.md
│   │   │   ├── invoicebench_001.pdf
│   │   │   ├── invoicebench_001.png
│   │   │   └── ...                    # 200 files
│   │   └── alamgirqazi/               # vendored (Apache-2.0)
│   │       └── ...                    # ~500 files
│   ├── receipts/
│   │   ├── wildreceipt/               # vendored (Apache-2.0)
│   │   └── synthetic/
│   ├── quotations/
│   │   ├── synthetic/
│   │   │   ├── quotation_simple.txt
│   │   │   └── quotation_with_footnotes.txt
│   │   └── oqo/                       # vendored OQO (CC-BY-4.0)
│   │       └── ...                    # 72 files
│   ├── procurement/
│   │   ├── ted_sample/                # vendored TED (Commission Decision)
│   │   ├── polish_tenders/
│   │   └── synthetic/
│   ├── multi_page/
│   │   ├── mp_docvqa/                 # vendored (MIT)
│   │   └── cuad/                      # vendored (CC-BY-4.0)
│   └── adversarial/                   # LAYER 1: edge cases
│       ├── empty_input.txt
│       ├── unicode_only.txt
│       ├── extremely_large.txt        # 10MB
│       ├── truncated_pdf.bin
│       ├── mismatched_currency.txt
│       └── prompt_injection.txt       # for inference security tests
├── artifacts/                         # LAYER 3: golden artifacts
│   ├── README.md                      # placeholder for golden artifacts
│   ├── invoice_success.json
│   ├── invoice_partial.json
│   ├── invoice_unresolved.json
│   ├── quotation_success.json
│   ├── procurement_success.json
│   └── invalid_contract.json
└── generated/                         # LAYER 2: programmatic (gitignored)
    ├── .gitignore
    └── README.md

Total size estimate: ~50 MB of vendored data, all open-licensed.


4. Licensing Policy

4.1 Allowed licenses for vendored data

Paxman is intended to be a commercially-distributed library. Vendored test data must be compatible with that.

License OK to vendor? Notes
MIT ✅ Yes No restrictions
Apache-2.0 ✅ Yes Include license + notice in DATASET_LICENSES.md
BSD-2 / BSD-3 ✅ Yes Include license
CC0 ✅ Yes No attribution required, but good practice to credit
CC-BY-4.0 ✅ Yes Attribution required; must be documented in DATASET_LICENSES.md
CC-BY-3.0 ✅ Yes Attribution required
CC-BY-SA-4.0 ⚠️ Conditional Share-alike applies to derivatives. Avoid unless we accept the SA clause.
CC-BY-NC-4.0 ❌ No for V1 vendor Non-commercial only. May be used in development by individual developers but not vendored in the repo.
CC-BY-NC-SA-4.0 ❌ No Non-commercial + share-alike
Research-only ❌ No Cannot redistribute. Use individual dev accounts for development.
Commercial ❌ No Costs money; budget approval required.
Unknown ❌ No Do not use.

Policy: V1 vendors only MIT, Apache-2.0, BSD, CC0, and CC-BY (non-SA) datasets. See DATASET_LICENSES.md for the full list.

4.2 Research-only datasets (development only)

Some important datasets (SROIE, INV-CDIP, MP-DocVQA subsets) are released under "research-only" terms. These cannot be vendored but can be used by individual developers for local development. The CI pipeline must not download or distribute them.

The strategy:

  • The team documents the download path in scripts/fetch_test_data.py (see §6).
  • CI runs only against vendored data.
  • A developer who wants to run with SROIE can run python scripts/fetch_test_data.py --dataset sroie locally.
  • CI does not exercise the SROIE code path; it's only for local exploration.

4.3 PII and sensitive data

Paxman normalizes potentially-sensitive data (invoices, receipts, procurement). Test data must not contain real PII.

Source PII status Action
CORD Already-anonymized Indonesian receipts ✅ OK to vendor
SROIE Desensitized by ICDAR (PII blurred) ✅ OK (research-only license is the constraint, not PII)
InvoiceBenchmark Synthetic ✅ OK
alamgirqazi Synthetic ✅ OK
wildreceipt Anonymized ✅ OK
OQO Anonymized ✅ OK
TED Real but public ⚠️ OK; contains real company names. Acceptable for a public dataset, but the public-procurement nature means it's not "private PII".
MIDD Real ⚠️ OK; manually verified but not anonymized
FATURA Synthetic ✅ OK
Polish Tenders Anonymized ✅ OK

Rule: if a dataset contains real PII that has not been explicitly anonymized, do not vendor it.

See SECURITY.md §2 PII Handling Defaults for the broader policy.


5. The Vendor Procedure

Adding a new vendored dataset follows this procedure. It is intentionally heavy because vendored data is forever.

5.1 Step-by-step

  1. Propose. Open an issue or PR with the dataset name, URL, license, intended use, and the layer it belongs to.
  2. License check. Confirm the license is in the "allowed" list (§4.1). If unsure, ask the maintainers.
  3. PII check. Confirm the dataset does not contain real PII (§4.3).
  4. Sample first. Vendor only a sample (≤ 200 files) for V1. The full dataset can be vendored in V2 if needed.
  5. Add to catalog. Add an entry to docs/TEST_DATA.md §2 and tests/fixtures/DATASET_LICENSES.md.
  6. Add to download script. Add the URL and extraction logic to scripts/fetch_test_data.py.
  7. Add to .gitignore. Add patterns to ensure non-vendored data (e.g., SROIE) is never committed.
  8. CI gate. Add a CI check that asserts no file under tests/fixtures/ violates the license catalog.
  9. Update tests. Reference the new dataset in the appropriate test file.
  10. Update docs. Update TESTING_STRATEGY.md if relevant.

5.2 What "vendor" means

  • Vendored = checked into git, downloaded by scripts/fetch_test_data.py, available to CI.
  • Not vendored = downloaded manually by individual developers, not in git, not in CI.

The boundary is sharp. If a dataset is "research-only" or "non-commercial", it is not vendored even if it would be useful.

5.3 Updating an existing dataset

When a new version of a vendored dataset is released:

  1. Verify the new version is still under an allowed license.
  2. Verify the new version does not introduce PII.
  3. Update the download URL and version in scripts/fetch_test_data.py.
  4. Run python scripts/fetch_test_data.py --update to re-vendor.
  5. Re-run the full test suite to confirm no regression.
  6. Update docs/TEST_DATA.md and tests/fixtures/DATASET_LICENSES.md.

6. The Download Script

scripts/fetch_test_data.py is the single entry point for vendoring datasets. It:

  • Downloads datasets from their canonical sources.
  • Extracts them to the right location under tests/fixtures/.
  • Verifies checksums.
  • Validates the license (refuses to vendor disallowed licenses).
  • Logs every action to tests/fixtures/DOWNLOAD_LOG.md.

The script is the source of truth for what's vendored. See scripts/fetch_test_data.py for the full implementation spec.

6.1 Usage

# Vendor everything (the V1 corpus)
python scripts/fetch_test_data.py

# Vendor a specific dataset
python scripts/fetch_test_data.py --dataset cord

# Update to the latest version of all vendored datasets
python scripts/fetch_test_data.py --update

# List datasets and their licenses
python scripts/fetch_test_data.py --list

# Validate the vendored data against the license catalog
python scripts/fetch_test_data.py --validate-licenses

6.2 CI integration

CI runs:

# Verify the vendored data is present and intact
python scripts/fetch_test_data.py --verify

# Verify the license catalog
python scripts/fetch_test_data.py --validate-licenses

CI does not download anything; it only verifies.

6.3 What the script does NOT do

  • It does not download research-only datasets (SROIE, INV-CDIP).
  • It does not download non-commercial datasets (FATURA, MIDD).
  • It does not modify the license catalog.
  • It does not run the test suite.

7. The Programmatic Layer (Layer 2)

Layer 2 fixtures are generated at test time. They live under tests/fixtures/generated/ (gitignored).

7.1 The generator module

tests/fixtures/generators/ will contain (in code, not in this doc):

  • invoices.pyfactory_boy factories for invoice inputs.
  • contracts.pypydantic-factories factories for CanonicalContracts.
  • candidates.pyfactory_boy factories for Candidate sets (Reconciler tests).
  • artifacts.pyfactory_boy factories for ExecutionArtifacts.
  • hypothesis_strategies.py — Hypothesis strategies for property tests.

7.2 The Hypothesis strategy catalog

A paxman.testing module (public) will expose:

from paxman.testing import strategies

# Generate CanonicalContracts
contract = strategies.contracts().example()

# Generate raw inputs
input_data = strategies.invoice_inputs().example()

# Generate Budgets
budget = strategies.budgets().example()

# Generate Policies
policy = strategies.policies().example()

# Generate CapabilityRegistries
registry = strategies.registries().example()

# Generate CandidateResult sets
candidates = strategies.candidate_sets().example()

# Generate ExecutionArtifacts
artifact = strategies.artifacts().example()

These strategies are the foundation of TESTING_STRATEGY.md §3 Property Tests.

7.3 Reproducibility

  • All programmatic fixtures are generated with a fixed random seed by default.
  • factory_boy uses factory.random.reseed_random(seed) for reproducibility.
  • hypothesis uses derandomize=True for the same purpose.
  • The seed is recorded in tests/fixtures/generated/SEED.txt so a failing test can be reproduced.

8. The Curated Layer (Layer 3)

Layer 3 fixtures are hand-picked and committed. They serve as the integration-test ground truth.

8.1 Selection criteria

A fixture is added to Layer 3 when:

  • It exercises a specific subsystem feature.
  • It has a known, expected ExecutionArtifact (a "golden artifact").
  • It would be expensive to regenerate programmatically.

8.2 The canonical V1 fixtures

These are the 5–10 curated fixtures for V1:

Fixture Contract Input Expected status
invoice_simple Invoice Pydantic Plain-text invoice SUCCESS
invoice_partial Invoice Pydantic Invoice with missing tax_amount PARTIAL_SUCCESS
invoice_unresolved Invoice Pydantic Invoice with no supplier name UNRESOLVED
quotation_simple Quotation Pydantic OQO-format quote SUCCESS
procurement_csv Procurement Pydantic CSV with multi-currency lines SUCCESS
invalid_contract Invoice Pydantic n/a (contract is bad) INVALID_CONTRACT
execution_failed Invoice Pydantic Input that triggers a capability crash EXECUTION_FAILED
money_mismatch Invoice Pydantic (multi-currency) Invoice with conflicting currencies PARTIAL_SUCCESS
multi_page MultiPageInvoice Multi-page PDF SUCCESS

8.3 Golden artifacts

For each curated fixture, tests/fixtures/artifacts/ contains a JSON file with the expected ExecutionArtifact, including the replay_hash. These are written by hand based on the canonical contract model and the planned planner behavior. They are the ground truth for replay-equality tests.

Important: the golden artifacts are versioned. When the ExecutionArtifact schema changes, the golden artifacts are regenerated. See REPLAY_AND_DETERMINISM.md §3 The Replay Protocol for the version-compatibility rules.

The actual golden artifacts are not written in this doc — they are written in code, against the actual ExecutionArtifact schema, once that schema is implemented. This is the deliberate gap mentioned in the prior conversation: we do not commit to an artifact JSON shape until the code defines it.


9. The Adversarial Layer (Layer 1)

Layer 1 is the smallest but most valuable. It contains purpose-built edge cases that have caused bugs in the past or are likely to.

9.1 The V1 adversarial catalog

Fixture What it tests Source
empty_input.txt Empty input Hand-written
unicode_only.txt Unicode-only input (no ASCII) Hand-written
extremely_large.txt 10 MB input Generated
truncated_pdf.bin Truncated PDF binary Generated
mismatched_currency.txt Invoice with conflicting currencies Hand-written
prompt_injection.txt Invoice text containing prompt-injection payload Hand-written
nested_500.txt Deeply nested JSON Generated
right_to_left.txt RTL language invoice Vendored (subset of CORD)
null_bytes.bin Input with embedded null bytes Generated
multiple_invoices.txt One input, multiple invoices Hand-written

These are committed to tests/fixtures/inputs/adversarial/. They are small (<1 MB total) but high-signal.


10. Attribution and Provenance

Every vendored file is attributed in tests/fixtures/DATASET_LICENSES.md. The file lists, for each vendored dataset:

  • Dataset name and version.
  • Source URL.
  • License.
  • Citation (paper or DOI).
  • Number of files vendored.
  • Date vendored.
  • Path to the vendored files.

Example entry:

### CORD (Consolidated Receipt Dataset)

- **Source:** https://github.com/clovaai/cord
- **Version:** v1 (HuggingFace mirror)
- **License:** CC-BY-4.0
- **Citation:** Park et al. (2019), "CORD: A Consolidated Receipt Dataset for Post-OCR Parsing", Document Intelligence Workshop at NeurIPS 2019.
- **Files vendored:** 100 (test split)
- **Path:** `tests/fixtures/inputs/invoices/cord/`
- **Vendored on:** 2026-06-22
- **Notes:** Used for receipt-parsing end-to-end tests. CC-BY-4.0 requires attribution, which is this entry.

The full file is at tests/fixtures/DATASET_LICENSES.md.


11. The "Real Data" Layer (Layer 5)

Layer 5 is never in the repo. It is the caller's actual production data.

11.1 The contract

When Paxman is used in production, the caller is expected to:

  1. Run the full test suite against their own sample of production data.
  2. Verify that the replay_hash is stable across runs.
  3. Verify that the field-resolution rate meets the success-metric target (≥ 90%).
  4. File an issue if a real-world input is not handled well — this is how the V1 corpus grows.

11.2 The contribution path

Real-world inputs that are anonymized and publicly shareable can be contributed to the V1 corpus via PR. The vendor procedure (§5) applies. Real-world inputs that contain PII or are under NDA must not be contributed.


12. The Corpus Roadmap

The V1 corpus is intentionally small. Here's how it grows:

Phase Corpus size Source
V0.1 (initial preview) 0 MB No vendored data; tests use programmatic only
V0.3 (alpha) ~10 MB 50 CORD + 50 InvoiceBenchmark
V0.5 (beta) ~35 MB V1 corpus as catalogued in §2.1
V1.0 (1.0 release) ~50 MB V1 corpus + curated golden artifacts
V2 ~200 MB Full CORD + SROIE (research-only) + MIDD + Polish Tenders
V3 ~1 GB Full HuggingFace receipt dataset + procurement data

The corpus grows only when a dataset is needed for a specific test, and only if its license is in the allowed list.


13. See also