Test Data Strategy¶

Status: Draft v1. Audience: Paxman contributors, maintainers, and downstream integrators. Related docs: TESTING_STRATEGY.md, DEVELOPMENT.md, EXTENDING.md, SECURITY.md, DEPENDENCIES.md

This document is the single source of truth for test data in Paxman. It defines:

The 5-layer test data model (synthetic → real).
The V1 dataset catalog — every dataset Paxman uses, with license, version, and intended use.
The licensing policy — what's allowed in a commercially-distributed library.
The vendor procedure — how to add or update a vendored dataset.
The attribution and provenance — how to credit every dataset.
The security and PII policy for test data.

Why this doc exists: Paxman normalizes data. The test suite must cover the full range of inputs, contracts, and edge cases. Test data shapes the contract canonical model, the planner heuristics, the reconciler logic, and the replay assertions. Designing the test data alongside the architecture (rather than retrofitting it later) is the cheaper path.

1. The 5-Layer Test Data Model¶

Paxman organizes test data in five layers, from synthetic to real:

Layer 5 — Real production data       (caller-supplied, NEVER committed)
            ▲
            │
Layer 4 — Public open datasets       (curated subsets, vendored in repo)
            ▲
            │
Layer 3 — Curated fixtures           (hand-picked, golden artifacts committed)
            ▲
            │
Layer 2 — Programmatic fixtures      (generated by factory_boy / faker / hypothesis)
            ▲
            │
Layer 1 — Synthetic edge cases       (purpose-built: malformed, adversarial, regression)

1.1 What each layer is for¶

Layer	Source	Committed?	Size	Purpose	Authored by
1. Synthetic edge cases	Hand-written or generated in-test	Yes (the small set)	Tiny (<1 MB)	Regression tests for known bugs, adversarial inputs, malformed data	Paxman team
2. Programmatic fixtures	`factory_boy` + `faker` + `hypothesis`	No (gitignored)	Generated per test run	Property tests, fuzzing, large variation coverage	Test code
3. Curated fixtures	Hand-picked from layer 4	Yes	~10–50 MB	Integration tests, replay golden artifacts	Paxman team
4. Public open datasets	Vendored from external open sources	Yes (subset only)	~50 MB for V1	End-to-end smoke tests, real-world OCR noise, multi-currency	External + curated
5. Real production data	Caller's actual data	NEVER	n/a	Production validation	Caller

1.2 The mapping to test layers¶

Test type (from TESTING_STRATEGY.md §1)	Data layer used
Property tests	Layer 2 (programmatic)
Unit tests	Layer 1 (edge cases) + small Layer 3 fixtures
Integration tests	Layer 3 (curated) + Layer 2 (generated)
End-to-end tests	Layer 4 (open datasets) + Layer 3 (golden artifacts)
Production validation	Layer 5 (real data) — never committed

2. V1 Dataset Catalog¶

This is the authoritative list of datasets Paxman uses in V1. Every entry has been verified for license compatibility with a commercially-distributed library.

2.1 Receipt and invoice documents (end-to-end, replay, evidence, MONEY)¶

Dataset	License	Size	What it gives Paxman	Layer	V1 use
clovaai/cord	CC-BY-4.0	1,000 receipts (1,000 OCR'd)	Receipt parsing with bbox + ground truth	4	Required
naver-clova-ix/cord-v1	CC-BY-4.0	1,000 receipts	CORD v1 mirror on HuggingFace	4	Required (cleaner mirror)
naver-clova-ix/cord-v2	CC-BY-4.0	1,000 receipts	CORD v2 with `sub_group_id`	4	Recommended
ICDAR 2019 SROIE	Research-only (per ICDAR-2019)	1,000 receipts	Real scanned receipts, key info extraction	4	Required for OCR noise (development only — see §4.3)
jngb-labs/InvoiceBenchmark	MIT	200 synthetic + 200 PDF + 200 PNG	Cent-perfect ground truth, multi-currency, deterministic	4	Required for V1 MONEY tests
alamgirqazi/invoice-ocr-synthetic	Apache-2.0	3,000 invoices	Full field-level labels, line items, multiple currencies	4	Required
AmineTibari/InvoiceJSON	CC-BY-4.0	340 invoices	Image + JSON structured annotations	4	Recommended
wizenheimer/invoices_receipts_ocr_v1	MIT	Subset of invoices-and-receipts_ocr_v1	OCR'd invoices with structured fields	4	Recommended
kaydee/wildreceipt	Apache-2.0	~1,500 wild receipts	"In-the-wild" receipts (harder than CORD/SROIE)	4	Recommended
mathieu1256/FATURA	CC-BY-NC-4.0	10,000 invoices, 50 layouts	Multi-layout invoices with bounding boxes	4	V2 only (NC license)
salesforce/inv-cdip	CC-BY-NC-4.0	350 labeled + 200k unlabeled	Real invoices with 7 fields	4	V2 only (NC license)
Voxel51/high-quality-invoice-images-for-ocr	Research	8,181 synthetic (1,489 labeled)	Synthetic invoices, tabular layouts	4	Optional V2

V1 corpus plan (committing to repo):

Source	Files vendored	Estimated size
CORD test split (CC-BY-4.0)	100 receipts	~5 MB
InvoiceBenchmark (MIT)	200 invoices (full)	~10 MB
alamgirqazi (Apache-2.0)	500 invoices	~15 MB
wildreceipt (Apache-2.0)	200 receipts	~5 MB
Total V1	~1,000 documents	~35 MB

SROIE is not vendored in the repo (research-only license) but is referenced in the test suite via a documented download path. See §3.

2.2 Contracts (Pydantic, JSON Schema, OpenAPI, Dict DSL)¶

Source	License	What it gives Paxman	V1 use
JSON-Schema-Test-Suite	BSD-3-Clause / Apache-2.0 (per file)	Every JSON Schema feature tested across drafts	Required for the JSON Schema adapter
OpenAPI Petstore v3.0	MIT	Canonical OpenAPI 3.0 example	Required for the OpenAPI adapter smoke test
OpenAPI Petstore v3.1	MIT	OpenAPI 3.1 with `$ref`, `oneOf`, etc.	Required
openapi-json-schema-tools test specs	Apache-2.0	3.0/3.1 unit test specs	Recommended
OQO (APH123614/oqo)	MIT (code) / CC-BY-4.0 (data)	Open-Quote-Object JSON Schema, 72 anonymized quotes	Highly recommended for the quotation use case
OpenAPI-Specification examples	Apache-2.0	~20 official example specs	Required — covers all common OAS features

2.3 Procurement data (use case C)¶

Source	License	Size	V1 use
TED (Tenders Electronic Daily)	Commission Decision 2011/833/EU	200k+ notices/yr	Required for the procurement use case
atlasprzetargow/polish-tenders-dataset	CC-BY-4.0 (data) / MIT (code)	1.4M Polish + EU notices	Recommended
OpenMLDatasets/ted_2025_07_sample	CC0 (sample) / commercial (full)	100 sample / 260k+ full	Required (sample) — easy entry point
Urwashanza/europrocure-10-public-procurement	(research)	1.5M records, 2016–2025	V2 only (research license)
GTI Global Public Procurement Dataset (GPPD)	CC BY-NC 3.0	42 countries, 2006–2021	V2 only (NC license)
Valan procurement sample	MIT (code) / free for eval (data)	1,000-row sample	Optional V2

2.4 Multi-page and long-form documents¶

Source	License	What it gives
lmms-lab/MP-DocVQA	MIT	Multi-page VQA — 5,000+ pages
theatticusproject/cuad	CC-BY-4.0	Contract Understanding Atticus Dataset — 510 contracts, 13k+ annotations
AgamiAI/Indian-Bank-Statements (via Thoughtworks benchmark)	Apache-2.0	Bank statements, multi-page tables

2.5 Synthetic data generators (programmatic, in-test)¶

These are libraries, not datasets. They generate Layer 2 fixtures at test time.

Library	License	Purpose	V1 use
Faker	MIT	Names, addresses, emails, currencies, dates	Required — core generator
factory_boy	MIT	Build Pydantic/dataclass factories using Faker	Required for V1 test code
pydantic-factories	MIT	Auto-derive factory from Pydantic model	Recommended for Pydantic-heavy tests
Hypothesis	MPL-2.0	Property-based test data generation	Required — already in dev deps
hypothesis-jsonschema	MPL-2.0	Generate instances from JSON Schema	Required for `contract/` adapter tests
dirty-equals	MIT	Fuzzy comparison for tests	Required for replay-equality tests

3. Directory layout¶

The full V1 test data directory tree, mirroring the 5-layer model:

tests/fixtures/
├── README.md                          # quick orientation
├── DATASET_LICENSES.md                # attribution for every vendored file
│
├── contracts/                         # LAYER 3: curated contracts
│   ├── pydantic/
│   │   ├── invoice.py
│   │   ├── quotation.py
│   │   ├── procurement.py
│   │   ├── receipt.py
│   │   ├── multi_page.py
│   │   └── edge_cases/
│   │       ├── empty_model.py
│   │       ├── deeply_nested.py
│   │       ├── with_money.py          # for MONEY type tests
│   │       ├── all_v1_types.py        # exercises every field type
│   │       └── invalid_*.py           # validator tests
│   ├── json_schema/
│   │   ├── invoice.json
│   │   ├── quotation.json
│   │   ├── procurement.json
│   │   ├── receipt.json
│   │   ├── multi_page.json
│   │   └── drafts/                    # covers all JSON Schema drafts
│   │       ├── draft-04.json
│   │       ├── draft-06.json
│   │       ├── draft-07.json
│   │       ├── draft-2019-09.json
│   │       └── draft-2020-12.json
│   ├── dict_dsl/
│   │   ├── invoice.py
│   │   ├── quotation.py
│   │   ├── procurement.py
│   │   └── edge_cases/
│   └── openapi/
│       ├── petstore_3_0.yaml
│       ├── petstore_3_1.yaml
│       ├── procurement_api.yaml
│       └── edge_cases/
│
├── inputs/                            # LAYER 4: open dataset samples
│   ├── invoices/
│   │   ├── synthetic/                 # small synthetic inputs for smoke tests
│   │   │   ├── invoice_plain.txt
│   │   │   ├── invoice_email.txt
│   │   │   └── invoice_csv.csv
│   │   ├── cord/                      # vendored CORD samples (CC-BY-4.0)
│   │   │   ├── cord_sample_001.png
│   │   │   ├── cord_sample_001.json
│   │   │   └── ...                    # ~100 files
│   │   ├── invoicebench/              # vendored InvoiceBenchmark (MIT)
│   │   │   ├── invoicebench_001.md
│   │   │   ├── invoicebench_001.pdf
│   │   │   ├── invoicebench_001.png
│   │   │   └── ...                    # 200 files
│   │   └── alamgirqazi/               # vendored (Apache-2.0)
│   │       └── ...                    # ~500 files
│   ├── receipts/
│   │   ├── wildreceipt/               # vendored (Apache-2.0)
│   │   └── synthetic/
│   ├── quotations/
│   │   ├── synthetic/
│   │   │   ├── quotation_simple.txt
│   │   │   └── quotation_with_footnotes.txt
│   │   └── oqo/                       # vendored OQO (CC-BY-4.0)
│   │       └── ...                    # 72 files
│   ├── procurement/
│   │   ├── ted_sample/                # vendored TED (Commission Decision)
│   │   ├── polish_tenders/
│   │   └── synthetic/
│   ├── multi_page/
│   │   ├── mp_docvqa/                 # vendored (MIT)
│   │   └── cuad/                      # vendored (CC-BY-4.0)
│   └── adversarial/                   # LAYER 1: edge cases
│       ├── empty_input.txt
│       ├── unicode_only.txt
│       ├── extremely_large.txt        # 10MB
│       ├── truncated_pdf.bin
│       ├── mismatched_currency.txt
│       └── prompt_injection.txt       # for inference security tests
│
├── artifacts/                         # LAYER 3: golden artifacts
│   ├── README.md                      # placeholder for golden artifacts
│   ├── invoice_success.json
│   ├── invoice_partial.json
│   ├── invoice_unresolved.json
│   ├── quotation_success.json
│   ├── procurement_success.json
│   └── invalid_contract.json
│
└── generated/                         # LAYER 2: programmatic (gitignored)
    ├── .gitignore
    └── README.md

Total size estimate: ~50 MB of vendored data, all open-licensed.

4. Licensing Policy¶

4.1 Allowed licenses for vendored data¶

Paxman is intended to be a commercially-distributed library. Vendored test data must be compatible with that.

License	OK to vendor?	Notes
MIT	✅ Yes	No restrictions
Apache-2.0	✅ Yes	Include license + notice in `DATASET_LICENSES.md`
BSD-2 / BSD-3	✅ Yes	Include license
CC0	✅ Yes	No attribution required, but good practice to credit
CC-BY-4.0	✅ Yes	Attribution required; must be documented in `DATASET_LICENSES.md`
CC-BY-3.0	✅ Yes	Attribution required
CC-BY-SA-4.0	⚠️ Conditional	Share-alike applies to derivatives. Avoid unless we accept the SA clause.
CC-BY-NC-4.0	❌ No for V1 vendor	Non-commercial only. May be used in development by individual developers but not vendored in the repo.
CC-BY-NC-SA-4.0	❌ No	Non-commercial + share-alike
Research-only	❌ No	Cannot redistribute. Use individual dev accounts for development.
Commercial	❌ No	Costs money; budget approval required.
Unknown	❌ No	Do not use.

Policy: V1 vendors only MIT, Apache-2.0, BSD, CC0, and CC-BY (non-SA) datasets. See DATASET_LICENSES.md for the full list.

4.2 Research-only datasets (development only)¶

Some important datasets (SROIE, INV-CDIP, MP-DocVQA subsets) are released under "research-only" terms. These cannot be vendored but can be used by individual developers for local development. The CI pipeline must not download or distribute them.

The strategy:

The team documents the download path in scripts/fetch_test_data.py (see §6).
CI runs only against vendored data.
A developer who wants to run with SROIE can run python scripts/fetch_test_data.py --dataset sroie locally.
CI does not exercise the SROIE code path; it's only for local exploration.

4.3 PII and sensitive data¶

Paxman normalizes potentially-sensitive data (invoices, receipts, procurement). Test data must not contain real PII.

Source	PII status	Action
CORD	Already-anonymized Indonesian receipts	✅ OK to vendor
SROIE	Desensitized by ICDAR (PII blurred)	✅ OK (research-only license is the constraint, not PII)
InvoiceBenchmark	Synthetic	✅ OK
alamgirqazi	Synthetic	✅ OK
wildreceipt	Anonymized	✅ OK
OQO	Anonymized	✅ OK
TED	Real but public	⚠️ OK; contains real company names. Acceptable for a public dataset, but the public-procurement nature means it's not "private PII".
MIDD	Real	⚠️ OK; manually verified but not anonymized
FATURA	Synthetic	✅ OK
Polish Tenders	Anonymized	✅ OK

Rule: if a dataset contains real PII that has not been explicitly anonymized, do not vendor it.

See SECURITY.md §2 PII Handling Defaults for the broader policy.

5. The Vendor Procedure¶

Adding a new vendored dataset follows this procedure. It is intentionally heavy because vendored data is forever.

5.1 Step-by-step¶

Propose. Open an issue or PR with the dataset name, URL, license, intended use, and the layer it belongs to.
License check. Confirm the license is in the "allowed" list (§4.1). If unsure, ask the maintainers.
PII check. Confirm the dataset does not contain real PII (§4.3).
Sample first. Vendor only a sample (≤ 200 files) for V1. The full dataset can be vendored in V2 if needed.
Add to catalog. Add an entry to docs/TEST_DATA.md §2 and tests/fixtures/DATASET_LICENSES.md.
Add to download script. Add the URL and extraction logic to scripts/fetch_test_data.py.
Add to .gitignore. Add patterns to ensure non-vendored data (e.g., SROIE) is never committed.
CI gate. Add a CI check that asserts no file under tests/fixtures/ violates the license catalog.
Update tests. Reference the new dataset in the appropriate test file.
Update docs. Update TESTING_STRATEGY.md if relevant.

5.2 What "vendor" means¶

Vendored = checked into git, downloaded by scripts/fetch_test_data.py, available to CI.
Not vendored = downloaded manually by individual developers, not in git, not in CI.

The boundary is sharp. If a dataset is "research-only" or "non-commercial", it is not vendored even if it would be useful.

5.3 Updating an existing dataset¶

When a new version of a vendored dataset is released:

Verify the new version is still under an allowed license.
Verify the new version does not introduce PII.
Update the download URL and version in scripts/fetch_test_data.py.
Run python scripts/fetch_test_data.py --update to re-vendor.
Re-run the full test suite to confirm no regression.
Update docs/TEST_DATA.md and tests/fixtures/DATASET_LICENSES.md.

6. The Download Script¶

scripts/fetch_test_data.py is the single entry point for vendoring datasets. It:

Downloads datasets from their canonical sources.
Extracts them to the right location under tests/fixtures/.
Verifies checksums.
Validates the license (refuses to vendor disallowed licenses).
Logs every action to tests/fixtures/DOWNLOAD_LOG.md.

The script is the source of truth for what's vendored. See scripts/fetch_test_data.py for the full implementation spec.

6.1 Usage¶

# Vendor everything (the V1 corpus)
python scripts/fetch_test_data.py

# Vendor a specific dataset
python scripts/fetch_test_data.py --dataset cord

# Update to the latest version of all vendored datasets
python scripts/fetch_test_data.py --update

# List datasets and their licenses
python scripts/fetch_test_data.py --list

# Validate the vendored data against the license catalog
python scripts/fetch_test_data.py --validate-licenses

6.2 CI integration¶

CI runs:

# Verify the vendored data is present and intact
python scripts/fetch_test_data.py --verify

# Verify the license catalog
python scripts/fetch_test_data.py --validate-licenses

CI does not download anything; it only verifies.

6.3 What the script does NOT do¶

It does not download research-only datasets (SROIE, INV-CDIP).
It does not download non-commercial datasets (FATURA, MIDD).
It does not modify the license catalog.
It does not run the test suite.

7. The Programmatic Layer (Layer 2)¶

Layer 2 fixtures are generated at test time. They live under tests/fixtures/generated/ (gitignored).

7.1 The generator module¶

tests/fixtures/generators/ will contain (in code, not in this doc):

invoices.py — factory_boy factories for invoice inputs.
contracts.py — pydantic-factories factories for CanonicalContracts.
candidates.py — factory_boy factories for Candidate sets (Reconciler tests).
artifacts.py — factory_boy factories for ExecutionArtifacts.
hypothesis_strategies.py — Hypothesis strategies for property tests.

7.2 The Hypothesis strategy catalog¶

A paxman.testing module (public) will expose:

from paxman.testing import strategies

# Generate CanonicalContracts
contract = strategies.contracts().example()

# Generate raw inputs
input_data = strategies.invoice_inputs().example()

# Generate Budgets
budget = strategies.budgets().example()

# Generate Policies
policy = strategies.policies().example()

# Generate CapabilityRegistries
registry = strategies.registries().example()

# Generate CandidateResult sets
candidates = strategies.candidate_sets().example()

# Generate ExecutionArtifacts
artifact = strategies.artifacts().example()

These strategies are the foundation of TESTING_STRATEGY.md §3 Property Tests.

7.3 Reproducibility¶

All programmatic fixtures are generated with a fixed random seed by default.
factory_boy uses factory.random.reseed_random(seed) for reproducibility.
hypothesis uses derandomize=True for the same purpose.
The seed is recorded in tests/fixtures/generated/SEED.txt so a failing test can be reproduced.

8. The Curated Layer (Layer 3)¶

Layer 3 fixtures are hand-picked and committed. They serve as the integration-test ground truth.

8.1 Selection criteria¶

A fixture is added to Layer 3 when:

It exercises a specific subsystem feature.
It has a known, expected ExecutionArtifact (a "golden artifact").
It would be expensive to regenerate programmatically.

8.2 The canonical V1 fixtures¶

These are the 5–10 curated fixtures for V1:

Fixture	Contract	Input	Expected status
`invoice_simple`	`Invoice` Pydantic	Plain-text invoice	`SUCCESS`
`invoice_partial`	`Invoice` Pydantic	Invoice with missing `tax_amount`	`PARTIAL_SUCCESS`
`invoice_unresolved`	`Invoice` Pydantic	Invoice with no supplier name	`UNRESOLVED`
`quotation_simple`	`Quotation` Pydantic	OQO-format quote	`SUCCESS`
`procurement_csv`	`Procurement` Pydantic	CSV with multi-currency lines	`SUCCESS`
`invalid_contract`	`Invoice` Pydantic	n/a (contract is bad)	`INVALID_CONTRACT`
`execution_failed`	`Invoice` Pydantic	Input that triggers a capability crash	`EXECUTION_FAILED`
`money_mismatch`	`Invoice` Pydantic (multi-currency)	Invoice with conflicting currencies	`PARTIAL_SUCCESS`
`multi_page`	`MultiPageInvoice`	Multi-page PDF	`SUCCESS`

8.3 Golden artifacts¶

For each curated fixture, tests/fixtures/artifacts/ contains a JSON file with the expected ExecutionArtifact, including the replay_hash. These are written by hand based on the canonical contract model and the planned planner behavior. They are the ground truth for replay-equality tests.

Important: the golden artifacts are versioned. When the ExecutionArtifact schema changes, the golden artifacts are regenerated. See REPLAY_AND_DETERMINISM.md §3 The Replay Protocol for the version-compatibility rules.

The actual golden artifacts are not written in this doc — they are written in code, against the actual ExecutionArtifact schema, once that schema is implemented. This is the deliberate gap mentioned in the prior conversation: we do not commit to an artifact JSON shape until the code defines it.

9. The Adversarial Layer (Layer 1)¶

Layer 1 is the smallest but most valuable. It contains purpose-built edge cases that have caused bugs in the past or are likely to.

9.1 The V1 adversarial catalog¶

Fixture	What it tests	Source
`empty_input.txt`	Empty input	Hand-written
`unicode_only.txt`	Unicode-only input (no ASCII)	Hand-written
`extremely_large.txt`	10 MB input	Generated
`truncated_pdf.bin`	Truncated PDF binary	Generated
`mismatched_currency.txt`	Invoice with conflicting currencies	Hand-written
`prompt_injection.txt`	Invoice text containing prompt-injection payload	Hand-written
`nested_500.txt`	Deeply nested JSON	Generated
`right_to_left.txt`	RTL language invoice	Vendored (subset of CORD)
`null_bytes.bin`	Input with embedded null bytes	Generated
`multiple_invoices.txt`	One input, multiple invoices	Hand-written

These are committed to tests/fixtures/inputs/adversarial/. They are small (<1 MB total) but high-signal.

10. Attribution and Provenance¶

Every vendored file is attributed in tests/fixtures/DATASET_LICENSES.md. The file lists, for each vendored dataset:

Dataset name and version.
Source URL.
License.
Citation (paper or DOI).
Number of files vendored.
Date vendored.
Path to the vendored files.

Example entry:

### CORD (Consolidated Receipt Dataset)

- **Source:** https://github.com/clovaai/cord
- **Version:** v1 (HuggingFace mirror)
- **License:** CC-BY-4.0
- **Citation:** Park et al. (2019), "CORD: A Consolidated Receipt Dataset for Post-OCR Parsing", Document Intelligence Workshop at NeurIPS 2019.
- **Files vendored:** 100 (test split)
- **Path:** `tests/fixtures/inputs/invoices/cord/`
- **Vendored on:** 2026-06-22
- **Notes:** Used for receipt-parsing end-to-end tests. CC-BY-4.0 requires attribution, which is this entry.

The full file is at tests/fixtures/DATASET_LICENSES.md.

11. The "Real Data" Layer (Layer 5)¶

Layer 5 is never in the repo. It is the caller's actual production data.

11.1 The contract¶

When Paxman is used in production, the caller is expected to:

Run the full test suite against their own sample of production data.
Verify that the replay_hash is stable across runs.
Verify that the field-resolution rate meets the success-metric target (≥ 90%).
File an issue if a real-world input is not handled well — this is how the V1 corpus grows.

11.2 The contribution path¶

Real-world inputs that are anonymized and publicly shareable can be contributed to the V1 corpus via PR. The vendor procedure (§5) applies. Real-world inputs that contain PII or are under NDA must not be contributed.

12. The Corpus Roadmap¶

The V1 corpus is intentionally small. Here's how it grows:

Phase	Corpus size	Source
V0.1 (initial preview)	0 MB	No vendored data; tests use programmatic only
V0.3 (alpha)	~10 MB	50 CORD + 50 InvoiceBenchmark
V0.5 (beta)	~35 MB	V1 corpus as catalogued in §2.1
V1.0 (1.0 release)	~50 MB	V1 corpus + curated golden artifacts
V2	~200 MB	Full CORD + SROIE (research-only) + MIDD + Polish Tenders
V3	~1 GB	Full HuggingFace receipt dataset + procurement data

The corpus grows only when a dataset is needed for a specific test, and only if its license is in the allowed list.

13. See also¶

TESTING_STRATEGY.md — test seams, property tests, replay tests
DEVELOPMENT.md §5 Running Tests — running the test suite
SECURITY.md §2 PII Handling Defaults — PII policy for test data
EXTENDING.md — adding new adapters, capabilities, providers
REPLAY_AND_DETERMINISM.md — replay-equality tests
DEPENDENCIES.md — dev dependencies (faker, factory_boy, hypothesis, etc.)
tests/fixtures/DATASET_LICENSES.md — full attribution catalog
scripts/fetch_test_data.py — download script