Test Data Strategy¶
Status: Draft v1. Audience: Paxman contributors, maintainers, and downstream integrators. Related docs: TESTING_STRATEGY.md, DEVELOPMENT.md, EXTENDING.md, SECURITY.md, DEPENDENCIES.md
This document is the single source of truth for test data in Paxman. It defines:
- The 5-layer test data model (synthetic → real).
- The V1 dataset catalog — every dataset Paxman uses, with license, version, and intended use.
- The licensing policy — what's allowed in a commercially-distributed library.
- The vendor procedure — how to add or update a vendored dataset.
- The attribution and provenance — how to credit every dataset.
- The security and PII policy for test data.
Why this doc exists: Paxman normalizes data. The test suite must cover the full range of inputs, contracts, and edge cases. Test data shapes the contract canonical model, the planner heuristics, the reconciler logic, and the replay assertions. Designing the test data alongside the architecture (rather than retrofitting it later) is the cheaper path.
1. The 5-Layer Test Data Model¶
Paxman organizes test data in five layers, from synthetic to real:
Layer 5 — Real production data (caller-supplied, NEVER committed)
▲
│
Layer 4 — Public open datasets (curated subsets, vendored in repo)
▲
│
Layer 3 — Curated fixtures (hand-picked, golden artifacts committed)
▲
│
Layer 2 — Programmatic fixtures (generated by factory_boy / faker / hypothesis)
▲
│
Layer 1 — Synthetic edge cases (purpose-built: malformed, adversarial, regression)
1.1 What each layer is for¶
| Layer | Source | Committed? | Size | Purpose | Authored by |
|---|---|---|---|---|---|
| 1. Synthetic edge cases | Hand-written or generated in-test | Yes (the small set) | Tiny (<1 MB) | Regression tests for known bugs, adversarial inputs, malformed data | Paxman team |
| 2. Programmatic fixtures | factory_boy + faker + hypothesis |
No (gitignored) | Generated per test run | Property tests, fuzzing, large variation coverage | Test code |
| 3. Curated fixtures | Hand-picked from layer 4 | Yes | ~10–50 MB | Integration tests, replay golden artifacts | Paxman team |
| 4. Public open datasets | Vendored from external open sources | Yes (subset only) | ~50 MB for V1 | End-to-end smoke tests, real-world OCR noise, multi-currency | External + curated |
| 5. Real production data | Caller's actual data | NEVER | n/a | Production validation | Caller |
1.2 The mapping to test layers¶
| Test type (from TESTING_STRATEGY.md §1) | Data layer used |
|---|---|
| Property tests | Layer 2 (programmatic) |
| Unit tests | Layer 1 (edge cases) + small Layer 3 fixtures |
| Integration tests | Layer 3 (curated) + Layer 2 (generated) |
| End-to-end tests | Layer 4 (open datasets) + Layer 3 (golden artifacts) |
| Production validation | Layer 5 (real data) — never committed |
2. V1 Dataset Catalog¶
This is the authoritative list of datasets Paxman uses in V1. Every entry has been verified for license compatibility with a commercially-distributed library.
2.1 Receipt and invoice documents (end-to-end, replay, evidence, MONEY)¶
| Dataset | License | Size | What it gives Paxman | Layer | V1 use |
|---|---|---|---|---|---|
| clovaai/cord | CC-BY-4.0 | 1,000 receipts (1,000 OCR'd) | Receipt parsing with bbox + ground truth | 4 | Required |
| naver-clova-ix/cord-v1 | CC-BY-4.0 | 1,000 receipts | CORD v1 mirror on HuggingFace | 4 | Required (cleaner mirror) |
| naver-clova-ix/cord-v2 | CC-BY-4.0 | 1,000 receipts | CORD v2 with sub_group_id |
4 | Recommended |
| ICDAR 2019 SROIE | Research-only (per ICDAR-2019) | 1,000 receipts | Real scanned receipts, key info extraction | 4 | Required for OCR noise (development only — see §4.3) |
| jngb-labs/InvoiceBenchmark | MIT | 200 synthetic + 200 PDF + 200 PNG | Cent-perfect ground truth, multi-currency, deterministic | 4 | Required for V1 MONEY tests |
| alamgirqazi/invoice-ocr-synthetic | Apache-2.0 | 3,000 invoices | Full field-level labels, line items, multiple currencies | 4 | Required |
| AmineTibari/InvoiceJSON | CC-BY-4.0 | 340 invoices | Image + JSON structured annotations | 4 | Recommended |
| wizenheimer/invoices_receipts_ocr_v1 | MIT | Subset of invoices-and-receipts_ocr_v1 | OCR'd invoices with structured fields | 4 | Recommended |
| kaydee/wildreceipt | Apache-2.0 | ~1,500 wild receipts | "In-the-wild" receipts (harder than CORD/SROIE) | 4 | Recommended |
| mathieu1256/FATURA | CC-BY-NC-4.0 | 10,000 invoices, 50 layouts | Multi-layout invoices with bounding boxes | 4 | V2 only (NC license) |
| salesforce/inv-cdip | CC-BY-NC-4.0 | 350 labeled + 200k unlabeled | Real invoices with 7 fields | 4 | V2 only (NC license) |
| Voxel51/high-quality-invoice-images-for-ocr | Research | 8,181 synthetic (1,489 labeled) | Synthetic invoices, tabular layouts | 4 | Optional V2 |
V1 corpus plan (committing to repo):
| Source | Files vendored | Estimated size |
|---|---|---|
| CORD test split (CC-BY-4.0) | 100 receipts | ~5 MB |
| InvoiceBenchmark (MIT) | 200 invoices (full) | ~10 MB |
| alamgirqazi (Apache-2.0) | 500 invoices | ~15 MB |
| wildreceipt (Apache-2.0) | 200 receipts | ~5 MB |
| Total V1 | ~1,000 documents | ~35 MB |
SROIE is not vendored in the repo (research-only license) but is referenced in the test suite via a documented download path. See §3.
2.2 Contracts (Pydantic, JSON Schema, OpenAPI, Dict DSL)¶
| Source | License | What it gives Paxman | V1 use |
|---|---|---|---|
| JSON-Schema-Test-Suite | BSD-3-Clause / Apache-2.0 (per file) | Every JSON Schema feature tested across drafts | Required for the JSON Schema adapter |
| OpenAPI Petstore v3.0 | MIT | Canonical OpenAPI 3.0 example | Required for the OpenAPI adapter smoke test |
| OpenAPI Petstore v3.1 | MIT | OpenAPI 3.1 with $ref, oneOf, etc. |
Required |
| openapi-json-schema-tools test specs | Apache-2.0 | 3.0/3.1 unit test specs | Recommended |
| OQO (APH123614/oqo) | MIT (code) / CC-BY-4.0 (data) | Open-Quote-Object JSON Schema, 72 anonymized quotes | Highly recommended for the quotation use case |
| OpenAPI-Specification examples | Apache-2.0 | ~20 official example specs | Required — covers all common OAS features |
2.3 Procurement data (use case C)¶
| Source | License | Size | V1 use |
|---|---|---|---|
| TED (Tenders Electronic Daily) | Commission Decision 2011/833/EU | 200k+ notices/yr | Required for the procurement use case |
| atlasprzetargow/polish-tenders-dataset | CC-BY-4.0 (data) / MIT (code) | 1.4M Polish + EU notices | Recommended |
| OpenMLDatasets/ted_2025_07_sample | CC0 (sample) / commercial (full) | 100 sample / 260k+ full | Required (sample) — easy entry point |
| Urwashanza/europrocure-10-public-procurement | (research) | 1.5M records, 2016–2025 | V2 only (research license) |
| GTI Global Public Procurement Dataset (GPPD) | CC BY-NC 3.0 | 42 countries, 2006–2021 | V2 only (NC license) |
| Valan procurement sample | MIT (code) / free for eval (data) | 1,000-row sample | Optional V2 |
2.4 Multi-page and long-form documents¶
| Source | License | What it gives |
|---|---|---|
| lmms-lab/MP-DocVQA | MIT | Multi-page VQA — 5,000+ pages |
| theatticusproject/cuad | CC-BY-4.0 | Contract Understanding Atticus Dataset — 510 contracts, 13k+ annotations |
| AgamiAI/Indian-Bank-Statements (via Thoughtworks benchmark) | Apache-2.0 | Bank statements, multi-page tables |
2.5 Synthetic data generators (programmatic, in-test)¶
These are libraries, not datasets. They generate Layer 2 fixtures at test time.
| Library | License | Purpose | V1 use |
|---|---|---|---|
| Faker | MIT | Names, addresses, emails, currencies, dates | Required — core generator |
| factory_boy | MIT | Build Pydantic/dataclass factories using Faker | Required for V1 test code |
| pydantic-factories | MIT | Auto-derive factory from Pydantic model | Recommended for Pydantic-heavy tests |
| Hypothesis | MPL-2.0 | Property-based test data generation | Required — already in dev deps |
| hypothesis-jsonschema | MPL-2.0 | Generate instances from JSON Schema | Required for contract/ adapter tests |
| dirty-equals | MIT | Fuzzy comparison for tests | Required for replay-equality tests |
3. Directory layout¶
The full V1 test data directory tree, mirroring the 5-layer model:
tests/fixtures/
├── README.md # quick orientation
├── DATASET_LICENSES.md # attribution for every vendored file
│
├── contracts/ # LAYER 3: curated contracts
│ ├── pydantic/
│ │ ├── invoice.py
│ │ ├── quotation.py
│ │ ├── procurement.py
│ │ ├── receipt.py
│ │ ├── multi_page.py
│ │ └── edge_cases/
│ │ ├── empty_model.py
│ │ ├── deeply_nested.py
│ │ ├── with_money.py # for MONEY type tests
│ │ ├── all_v1_types.py # exercises every field type
│ │ └── invalid_*.py # validator tests
│ ├── json_schema/
│ │ ├── invoice.json
│ │ ├── quotation.json
│ │ ├── procurement.json
│ │ ├── receipt.json
│ │ ├── multi_page.json
│ │ └── drafts/ # covers all JSON Schema drafts
│ │ ├── draft-04.json
│ │ ├── draft-06.json
│ │ ├── draft-07.json
│ │ ├── draft-2019-09.json
│ │ └── draft-2020-12.json
│ ├── dict_dsl/
│ │ ├── invoice.py
│ │ ├── quotation.py
│ │ ├── procurement.py
│ │ └── edge_cases/
│ └── openapi/
│ ├── petstore_3_0.yaml
│ ├── petstore_3_1.yaml
│ ├── procurement_api.yaml
│ └── edge_cases/
│
├── inputs/ # LAYER 4: open dataset samples
│ ├── invoices/
│ │ ├── synthetic/ # small synthetic inputs for smoke tests
│ │ │ ├── invoice_plain.txt
│ │ │ ├── invoice_email.txt
│ │ │ └── invoice_csv.csv
│ │ ├── cord/ # vendored CORD samples (CC-BY-4.0)
│ │ │ ├── cord_sample_001.png
│ │ │ ├── cord_sample_001.json
│ │ │ └── ... # ~100 files
│ │ ├── invoicebench/ # vendored InvoiceBenchmark (MIT)
│ │ │ ├── invoicebench_001.md
│ │ │ ├── invoicebench_001.pdf
│ │ │ ├── invoicebench_001.png
│ │ │ └── ... # 200 files
│ │ └── alamgirqazi/ # vendored (Apache-2.0)
│ │ └── ... # ~500 files
│ ├── receipts/
│ │ ├── wildreceipt/ # vendored (Apache-2.0)
│ │ └── synthetic/
│ ├── quotations/
│ │ ├── synthetic/
│ │ │ ├── quotation_simple.txt
│ │ │ └── quotation_with_footnotes.txt
│ │ └── oqo/ # vendored OQO (CC-BY-4.0)
│ │ └── ... # 72 files
│ ├── procurement/
│ │ ├── ted_sample/ # vendored TED (Commission Decision)
│ │ ├── polish_tenders/
│ │ └── synthetic/
│ ├── multi_page/
│ │ ├── mp_docvqa/ # vendored (MIT)
│ │ └── cuad/ # vendored (CC-BY-4.0)
│ └── adversarial/ # LAYER 1: edge cases
│ ├── empty_input.txt
│ ├── unicode_only.txt
│ ├── extremely_large.txt # 10MB
│ ├── truncated_pdf.bin
│ ├── mismatched_currency.txt
│ └── prompt_injection.txt # for inference security tests
│
├── artifacts/ # LAYER 3: golden artifacts
│ ├── README.md # placeholder for golden artifacts
│ ├── invoice_success.json
│ ├── invoice_partial.json
│ ├── invoice_unresolved.json
│ ├── quotation_success.json
│ ├── procurement_success.json
│ └── invalid_contract.json
│
└── generated/ # LAYER 2: programmatic (gitignored)
├── .gitignore
└── README.md
Total size estimate: ~50 MB of vendored data, all open-licensed.
4. Licensing Policy¶
4.1 Allowed licenses for vendored data¶
Paxman is intended to be a commercially-distributed library. Vendored test data must be compatible with that.
| License | OK to vendor? | Notes |
|---|---|---|
| MIT | ✅ Yes | No restrictions |
| Apache-2.0 | ✅ Yes | Include license + notice in DATASET_LICENSES.md |
| BSD-2 / BSD-3 | ✅ Yes | Include license |
| CC0 | ✅ Yes | No attribution required, but good practice to credit |
| CC-BY-4.0 | ✅ Yes | Attribution required; must be documented in DATASET_LICENSES.md |
| CC-BY-3.0 | ✅ Yes | Attribution required |
| CC-BY-SA-4.0 | ⚠️ Conditional | Share-alike applies to derivatives. Avoid unless we accept the SA clause. |
| CC-BY-NC-4.0 | ❌ No for V1 vendor | Non-commercial only. May be used in development by individual developers but not vendored in the repo. |
| CC-BY-NC-SA-4.0 | ❌ No | Non-commercial + share-alike |
| Research-only | ❌ No | Cannot redistribute. Use individual dev accounts for development. |
| Commercial | ❌ No | Costs money; budget approval required. |
| Unknown | ❌ No | Do not use. |
Policy: V1 vendors only MIT, Apache-2.0, BSD, CC0, and CC-BY (non-SA) datasets. See DATASET_LICENSES.md for the full list.
4.2 Research-only datasets (development only)¶
Some important datasets (SROIE, INV-CDIP, MP-DocVQA subsets) are released under "research-only" terms. These cannot be vendored but can be used by individual developers for local development. The CI pipeline must not download or distribute them.
The strategy:
- The team documents the download path in
scripts/fetch_test_data.py(see §6). - CI runs only against vendored data.
- A developer who wants to run with SROIE can run
python scripts/fetch_test_data.py --dataset sroielocally. - CI does not exercise the SROIE code path; it's only for local exploration.
4.3 PII and sensitive data¶
Paxman normalizes potentially-sensitive data (invoices, receipts, procurement). Test data must not contain real PII.
| Source | PII status | Action |
|---|---|---|
| CORD | Already-anonymized Indonesian receipts | ✅ OK to vendor |
| SROIE | Desensitized by ICDAR (PII blurred) | ✅ OK (research-only license is the constraint, not PII) |
| InvoiceBenchmark | Synthetic | ✅ OK |
| alamgirqazi | Synthetic | ✅ OK |
| wildreceipt | Anonymized | ✅ OK |
| OQO | Anonymized | ✅ OK |
| TED | Real but public | ⚠️ OK; contains real company names. Acceptable for a public dataset, but the public-procurement nature means it's not "private PII". |
| MIDD | Real | ⚠️ OK; manually verified but not anonymized |
| FATURA | Synthetic | ✅ OK |
| Polish Tenders | Anonymized | ✅ OK |
Rule: if a dataset contains real PII that has not been explicitly anonymized, do not vendor it.
See SECURITY.md §2 PII Handling Defaults for the broader policy.
5. The Vendor Procedure¶
Adding a new vendored dataset follows this procedure. It is intentionally heavy because vendored data is forever.
5.1 Step-by-step¶
- Propose. Open an issue or PR with the dataset name, URL, license, intended use, and the layer it belongs to.
- License check. Confirm the license is in the "allowed" list (§4.1). If unsure, ask the maintainers.
- PII check. Confirm the dataset does not contain real PII (§4.3).
- Sample first. Vendor only a sample (≤ 200 files) for V1. The full dataset can be vendored in V2 if needed.
- Add to catalog. Add an entry to
docs/TEST_DATA.md§2 andtests/fixtures/DATASET_LICENSES.md. - Add to download script. Add the URL and extraction logic to
scripts/fetch_test_data.py. - Add to .gitignore. Add patterns to ensure non-vendored data (e.g., SROIE) is never committed.
- CI gate. Add a CI check that asserts no file under
tests/fixtures/violates the license catalog. - Update tests. Reference the new dataset in the appropriate test file.
- Update docs. Update TESTING_STRATEGY.md if relevant.
5.2 What "vendor" means¶
- Vendored = checked into git, downloaded by
scripts/fetch_test_data.py, available to CI. - Not vendored = downloaded manually by individual developers, not in git, not in CI.
The boundary is sharp. If a dataset is "research-only" or "non-commercial", it is not vendored even if it would be useful.
5.3 Updating an existing dataset¶
When a new version of a vendored dataset is released:
- Verify the new version is still under an allowed license.
- Verify the new version does not introduce PII.
- Update the download URL and version in
scripts/fetch_test_data.py. - Run
python scripts/fetch_test_data.py --updateto re-vendor. - Re-run the full test suite to confirm no regression.
- Update
docs/TEST_DATA.mdandtests/fixtures/DATASET_LICENSES.md.
6. The Download Script¶
scripts/fetch_test_data.py is the single entry point for vendoring datasets. It:
- Downloads datasets from their canonical sources.
- Extracts them to the right location under
tests/fixtures/. - Verifies checksums.
- Validates the license (refuses to vendor disallowed licenses).
- Logs every action to
tests/fixtures/DOWNLOAD_LOG.md.
The script is the source of truth for what's vendored. See scripts/fetch_test_data.py for the full implementation spec.
6.1 Usage¶
# Vendor everything (the V1 corpus)
python scripts/fetch_test_data.py
# Vendor a specific dataset
python scripts/fetch_test_data.py --dataset cord
# Update to the latest version of all vendored datasets
python scripts/fetch_test_data.py --update
# List datasets and their licenses
python scripts/fetch_test_data.py --list
# Validate the vendored data against the license catalog
python scripts/fetch_test_data.py --validate-licenses
6.2 CI integration¶
CI runs:
# Verify the vendored data is present and intact
python scripts/fetch_test_data.py --verify
# Verify the license catalog
python scripts/fetch_test_data.py --validate-licenses
CI does not download anything; it only verifies.
6.3 What the script does NOT do¶
- It does not download research-only datasets (SROIE, INV-CDIP).
- It does not download non-commercial datasets (FATURA, MIDD).
- It does not modify the license catalog.
- It does not run the test suite.
7. The Programmatic Layer (Layer 2)¶
Layer 2 fixtures are generated at test time. They live under tests/fixtures/generated/ (gitignored).
7.1 The generator module¶
tests/fixtures/generators/ will contain (in code, not in this doc):
invoices.py—factory_boyfactories for invoice inputs.contracts.py—pydantic-factoriesfactories forCanonicalContracts.candidates.py—factory_boyfactories forCandidatesets (Reconciler tests).artifacts.py—factory_boyfactories forExecutionArtifacts.hypothesis_strategies.py— Hypothesis strategies for property tests.
7.2 The Hypothesis strategy catalog¶
A paxman.testing module (public) will expose:
from paxman.testing import strategies
# Generate CanonicalContracts
contract = strategies.contracts().example()
# Generate raw inputs
input_data = strategies.invoice_inputs().example()
# Generate Budgets
budget = strategies.budgets().example()
# Generate Policies
policy = strategies.policies().example()
# Generate CapabilityRegistries
registry = strategies.registries().example()
# Generate CandidateResult sets
candidates = strategies.candidate_sets().example()
# Generate ExecutionArtifacts
artifact = strategies.artifacts().example()
These strategies are the foundation of TESTING_STRATEGY.md §3 Property Tests.
7.3 Reproducibility¶
- All programmatic fixtures are generated with a fixed random seed by default.
factory_boyusesfactory.random.reseed_random(seed)for reproducibility.hypothesisusesderandomize=Truefor the same purpose.- The seed is recorded in
tests/fixtures/generated/SEED.txtso a failing test can be reproduced.
8. The Curated Layer (Layer 3)¶
Layer 3 fixtures are hand-picked and committed. They serve as the integration-test ground truth.
8.1 Selection criteria¶
A fixture is added to Layer 3 when:
- It exercises a specific subsystem feature.
- It has a known, expected
ExecutionArtifact(a "golden artifact"). - It would be expensive to regenerate programmatically.
8.2 The canonical V1 fixtures¶
These are the 5–10 curated fixtures for V1:
| Fixture | Contract | Input | Expected status |
|---|---|---|---|
invoice_simple |
Invoice Pydantic |
Plain-text invoice | SUCCESS |
invoice_partial |
Invoice Pydantic |
Invoice with missing tax_amount |
PARTIAL_SUCCESS |
invoice_unresolved |
Invoice Pydantic |
Invoice with no supplier name | UNRESOLVED |
quotation_simple |
Quotation Pydantic |
OQO-format quote | SUCCESS |
procurement_csv |
Procurement Pydantic |
CSV with multi-currency lines | SUCCESS |
invalid_contract |
Invoice Pydantic |
n/a (contract is bad) | INVALID_CONTRACT |
execution_failed |
Invoice Pydantic |
Input that triggers a capability crash | EXECUTION_FAILED |
money_mismatch |
Invoice Pydantic (multi-currency) |
Invoice with conflicting currencies | PARTIAL_SUCCESS |
multi_page |
MultiPageInvoice |
Multi-page PDF | SUCCESS |
8.3 Golden artifacts¶
For each curated fixture, tests/fixtures/artifacts/ contains a JSON file with the expected ExecutionArtifact, including the replay_hash. These are written by hand based on the canonical contract model and the planned planner behavior. They are the ground truth for replay-equality tests.
Important: the golden artifacts are versioned. When the ExecutionArtifact schema changes, the golden artifacts are regenerated. See REPLAY_AND_DETERMINISM.md §3 The Replay Protocol for the version-compatibility rules.
The actual golden artifacts are not written in this doc — they are written in code, against the actual ExecutionArtifact schema, once that schema is implemented. This is the deliberate gap mentioned in the prior conversation: we do not commit to an artifact JSON shape until the code defines it.
9. The Adversarial Layer (Layer 1)¶
Layer 1 is the smallest but most valuable. It contains purpose-built edge cases that have caused bugs in the past or are likely to.
9.1 The V1 adversarial catalog¶
| Fixture | What it tests | Source |
|---|---|---|
empty_input.txt |
Empty input | Hand-written |
unicode_only.txt |
Unicode-only input (no ASCII) | Hand-written |
extremely_large.txt |
10 MB input | Generated |
truncated_pdf.bin |
Truncated PDF binary | Generated |
mismatched_currency.txt |
Invoice with conflicting currencies | Hand-written |
prompt_injection.txt |
Invoice text containing prompt-injection payload | Hand-written |
nested_500.txt |
Deeply nested JSON | Generated |
right_to_left.txt |
RTL language invoice | Vendored (subset of CORD) |
null_bytes.bin |
Input with embedded null bytes | Generated |
multiple_invoices.txt |
One input, multiple invoices | Hand-written |
These are committed to tests/fixtures/inputs/adversarial/. They are small (<1 MB total) but high-signal.
10. Attribution and Provenance¶
Every vendored file is attributed in tests/fixtures/DATASET_LICENSES.md. The file lists, for each vendored dataset:
- Dataset name and version.
- Source URL.
- License.
- Citation (paper or DOI).
- Number of files vendored.
- Date vendored.
- Path to the vendored files.
Example entry:
### CORD (Consolidated Receipt Dataset)
- **Source:** https://github.com/clovaai/cord
- **Version:** v1 (HuggingFace mirror)
- **License:** CC-BY-4.0
- **Citation:** Park et al. (2019), "CORD: A Consolidated Receipt Dataset for Post-OCR Parsing", Document Intelligence Workshop at NeurIPS 2019.
- **Files vendored:** 100 (test split)
- **Path:** `tests/fixtures/inputs/invoices/cord/`
- **Vendored on:** 2026-06-22
- **Notes:** Used for receipt-parsing end-to-end tests. CC-BY-4.0 requires attribution, which is this entry.
The full file is at tests/fixtures/DATASET_LICENSES.md.
11. The "Real Data" Layer (Layer 5)¶
Layer 5 is never in the repo. It is the caller's actual production data.
11.1 The contract¶
When Paxman is used in production, the caller is expected to:
- Run the full test suite against their own sample of production data.
- Verify that the
replay_hashis stable across runs. - Verify that the field-resolution rate meets the success-metric target (≥ 90%).
- File an issue if a real-world input is not handled well — this is how the V1 corpus grows.
11.2 The contribution path¶
Real-world inputs that are anonymized and publicly shareable can be contributed to the V1 corpus via PR. The vendor procedure (§5) applies. Real-world inputs that contain PII or are under NDA must not be contributed.
12. The Corpus Roadmap¶
The V1 corpus is intentionally small. Here's how it grows:
| Phase | Corpus size | Source |
|---|---|---|
| V0.1 (initial preview) | 0 MB | No vendored data; tests use programmatic only |
| V0.3 (alpha) | ~10 MB | 50 CORD + 50 InvoiceBenchmark |
| V0.5 (beta) | ~35 MB | V1 corpus as catalogued in §2.1 |
| V1.0 (1.0 release) | ~50 MB | V1 corpus + curated golden artifacts |
| V2 | ~200 MB | Full CORD + SROIE (research-only) + MIDD + Polish Tenders |
| V3 | ~1 GB | Full HuggingFace receipt dataset + procurement data |
The corpus grows only when a dataset is needed for a specific test, and only if its license is in the allowed list.
13. See also¶
- TESTING_STRATEGY.md — test seams, property tests, replay tests
- DEVELOPMENT.md §5 Running Tests — running the test suite
- SECURITY.md §2 PII Handling Defaults — PII policy for test data
- EXTENDING.md — adding new adapters, capabilities, providers
- REPLAY_AND_DETERMINISM.md — replay-equality tests
- DEPENDENCIES.md — dev dependencies (faker, factory_boy, hypothesis, etc.)
- tests/fixtures/DATASET_LICENSES.md — full attribution catalog
- scripts/fetch_test_data.py — download script