InputProfile Module Specification¶
Status: Accepted Date: 2026-06-22 Deciders: Paxman core team Related docs: ARCHITECTURE.md §4.2, PACKAGE_STRUCTURE.md §4.2, ADR-0002, REPLAY_AND_DETERMINISM.md
1. Overview¶
InputProfile is a lightweight metadata structure derived from raw input before any capability invocation. It sits between input reception and planner.plan(...), providing the Planner with the minimal signal it needs to make heuristic decisions without re-scanning the input.
Per ARCHITECTURE.md §4.2, the Planner is a pure function over (CanonicalContract, InputProfile, Budget, Policy, CapabilityRegistry). The InputProfile is the Planner's view of the input — not the input itself. It is derived once, passed into the Planner, and never mutated.
Why it exists: The Planner's heuristic chain needs to answer questions like "is this input empty?", "is this JSON or plain text?", "how dense is the content?" — without invoking capabilities (which is the Executor's job). InputProfile answers these questions in O(input_size) time with zero side effects.
Where it lives: planner/input_profile.py per PACKAGE_STRUCTURE.md §4.2. It is an internal module; not re-exported in api/.
2. Design Goals¶
- Lightweight. O(input_size) once, cached via
content_hash. No multi-pass scanning, no external calls. - Deterministic. Same input bytes → same InputProfile, byte-for-byte. No clock, no I/O, no random state.
- Pure. No capability invocation. The profile is derived from raw bytes alone. Per PACKAGE_STRUCTURE.md §4.2: "lightweight input classifier (no capability invocation)."
- Side-effect-free constructor. The
make_profile()function is a pure function. Per ADR-0002, the Planner is a pure function; its inputs must be equally pure. - Five-field V1 surface. Cap at 5 fields. Defer content classification (e.g., "is this a financial document?") to V2.
- Constructed at most once per
paxman.normalize()call. The Planner receives the already-constructed profile. No lazy construction, no re-derivation.
3. Data Model¶
from attrs import frozen
from typing import Literal
InputType = Literal[
"text",
"html",
"csv",
"json",
"pdf_text",
"email",
"empty",
"unknown",
]
@frozen
class InputProfile:
input_type: InputType
size: int
content_hash: str # 64-char lowercase hex (SHA-256)
density: float # 0.0–1.0
is_empty: bool
3.1 Field Definitions¶
| Field | Type | Range | Semantics | Readers |
|---|---|---|---|---|
input_type |
InputType |
8 enum values | Coarse classification of the raw input format. Drives heuristic selection in the Planner (e.g., "JSON input → try regex before inference"). | Planner (heuristics.py) |
size |
int |
>= 0 |
Byte count of the normalized input (UTF-8 encoded if str was provided). Used by Planner to estimate cost and by Budget to reject oversized inputs. |
Planner, Budget |
content_hash |
str |
64-char lowercase hex | SHA-256 hex digest of the normalized bytes. Used for replay hash composition (REPLAY_AND_DETERMINISM.md §2.1) and as a cache key. | Artifact (replay_hash), Planner (cache key) |
density |
float |
0.0 <= density <= 1.0 |
Ratio of non-whitespace characters to total bytes. Heuristic signal for "is this a token-heavy LLM candidate?" Used by Planner to short-circuit capability selection. | Planner (scoring.py) |
is_empty |
bool |
True / False |
True if size == 0 or the input is whitespace-only. The Planner short-circuits to UNRESOLVED for all fields when is_empty is True. |
Planner (heuristics.py) |
3.2 Validation Rules¶
The @frozen decorator enforces immutability. Additional invariants are validated at construction time:
| Rule | Enforcement |
|---|---|
size >= 0 |
attrs validator; raises ValueError |
0.0 <= density <= 1.0 |
attrs validator; raises ValueError |
content_hash is 64-char lowercase hex |
attrs validator; raises ValueError |
input_type is one of the 8 Literal values |
typing.Literal check at type-check time; runtime guard in make_profile() |
4. Construction Algorithm¶
InputProfile is constructed by the make_profile() function. It is a pure function: no I/O, no clock, no random state, no capability invocation.
4.1 Signature¶
def make_profile(input_data: str | bytes) -> InputProfile:
"""Derive an InputProfile from raw input.
Args:
input_data: Raw input as str or bytes. If str, encoded as UTF-8
with errors="replace".
Returns:
A frozen InputProfile with 5 fields.
"""
4.2 Algorithm¶
function make_profile(input_data):
1. Normalize to bytes:
- If input_data is str → encode as UTF-8 with errors="replace"
- If input_data is bytes → use as-is
2. size = len(normalized_bytes)
3. content_hash = sha256(normalized_bytes).hexdigest()
4. is_empty = (size == 0) OR (normalized_bytes.strip() == b"")
5. input_type = classify(normalized_bytes) // see §5
6. density = compute_density(normalized_bytes, input_type) // see §6
7. return InputProfile(
input_type=input_type,
size=size,
content_hash=content_hash,
density=density,
is_empty=is_empty,
)
4.3 Determinism Guarantee¶
Every step is deterministic:
- UTF-8 encoding with
errors="replace"is deterministic (replacement characterU+FFFDencodes asb'\xef\xbf\xbd'). len()is deterministic.sha256is deterministic over the same bytes.classify()is a pure function over bytes (§5).compute_density()is a pure function over bytes and input_type (§6).
No step reads the clock, generates random numbers, or performs I/O.
5. Input Type Classification (classify)¶
The classify() function determines the coarse format of the input. Rules are evaluated in order; the first match wins.
5.1 Rules Table¶
| Priority | Condition | Result | Rationale |
|---|---|---|---|
| 1 | size == 0 |
"empty" |
No bytes to classify. |
| 2 | First 16 bytes start with %PDF- |
"pdf_text" |
Text extracted from a PDF (not raw PDF binary; raw PDF detection is V2). |
| 3 | First non-whitespace byte is { or [ AND json.loads() succeeds |
"json" |
Valid JSON object or array. |
| 4 | First 5 bytes are From (with trailing space) AND Subject: header present in first 4096 bytes |
"email" |
RFC 822 email format. |
| 5 | Contains <html or <!DOCTYPE html (case-insensitive, first 4096 bytes) |
"html" |
HTML document. |
| 6 | First non-whitespace line contains , more than 2 times |
"csv" |
Heuristic CSV detection (comma-separated values). |
| 7 | None of the above match | "text" |
Default: treat as plain text. |
| 8 | Classification fails (e.g., binary garbage causes decode errors in rule evaluation) | "unknown" |
Fallback for unclassifiable input. |
5.2 Notes¶
- Rule 2 (PDF): This detects text-extracted PDF content (e.g., output from
pdftotext). Raw PDF binary (%PDF-1.4followed by binary streams) is V2. In V1, thepdf_texttype signals that the input is already human-readable text extracted from a PDF. - Rule 3 (JSON):
json.loads()is called on the first 8192 bytes (or the full input if smaller) to avoid parsing a 100 MB JSON blob during classification. If parsing fails, the rule does not match. - Rule 4 (Email): The
Subject:header must appear in the first 4096 bytes to avoid scanning large inputs for email headers that may not exist. - Rule 5 (HTML): Case-insensitive match on the first 4096 bytes. This catches
<!DOCTYPE HTML>,<HTML>, and variations. - Rule 6 (CSV): A simple heuristic. A line with 3+ commas is likely CSV. This is intentionally coarse; false positives for comma-heavy text are acceptable (the Planner can still use other heuristics).
- Rule 8 (Unknown): Reached only when the input contains binary data that causes errors in the classification rules (e.g.,
UnicodeDecodeErrorduring HTML/JSON matching). Plain binary that passes all rules without matching falls through to"text".
6. Density Computation (compute_density)¶
Density is a heuristic signal for the Planner, not a hard constraint. It answers: "what fraction of the input is non-whitespace content?"
6.1 Formula¶
function compute_density(normalized_bytes, input_type):
if input_type in ("empty", "unknown"):
return 0.0
// Decode to text (UTF-8 with replacement)
text = normalized_bytes.decode("utf-8", errors="replace")
// Count non-whitespace characters
non_ws = len(text.replace(" ", "").replace("\t", "").replace("\n", "").replace("\r", ""))
// Density = non-whitespace chars / total bytes
return non_ws / len(normalized_bytes) if len(normalized_bytes) > 0 else 0.0
6.2 Per-Input-Type Behavior¶
input_type |
Density computation |
|---|---|
"text", "email", "html", "csv", "json", "pdf_text" |
len(non_whitespace_text) / size |
"empty" |
0.0 (hardcoded) |
"unknown" |
0.0 (hardcoded) |
6.3 Semantics¶
- Range:
0.0(all whitespace or binary) to1.0(zero whitespace). Typical text inputs have density0.7–0.9. - Heuristic purpose: The Planner uses density to short-circuit decisions like "is this a token-heavy LLM candidate?" A density near
0.0suggests the input is mostly whitespace or binary; a density near1.0suggests dense text. - Not a hard signal: Density does not determine capability selection directly. It is one input among many (alongside
input_type,size,is_empty) that the Planner's heuristic chain considers. - V2 note: Content-aware density (e.g., "how much of this HTML is actual content vs. tags?") is deferred to V2.
7. Worked Example¶
Input: b"Hello, World!\n"
Step-by-step trace¶
| Step | Operation | Result |
|---|---|---|
| 1 | Normalize to bytes | b"Hello, World!\n" (already bytes) |
| 2 | size = len(b"Hello, World!\n") |
14 |
| 3 | content_hash = sha256(b"Hello, World!\n").hexdigest() |
"d9014c4624844aa5bac314773d6b689ad467fa4e1d1a50a1b8a99d5a95f72ff5" (example; actual hash computed at impl time) |
| 4 | is_empty = (14 == 0) or (b"Hello, World!\n".strip() == b"") |
False |
| 5 | classify(b"Hello, World!\n") |
"text" (no JSON markers, no HTML, no CSV commas, no email headers, no PDF prefix) |
| 6 | compute_density(b"Hello, World!\n", "text") |
Non-whitespace: Hello,World! = 12 chars. 12 / 14 = 0.857 |
Result¶
InputProfile(
input_type="text",
size=14,
content_hash="d9014c4624844aa5bac314773d6b689ad467fa4e1d1a50a1b8a99d5a95f72ff5",
density=0.857,
is_empty=False,
)
8. Edge Cases¶
EC1: Empty input¶
Input: b""
| Field | Value | Reasoning |
|---|---|---|
input_type |
"empty" |
Rule 1: size == 0 |
size |
0 |
len(b"") |
content_hash |
"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" |
SHA-256 of empty bytes |
density |
0.0 |
Hardcoded for "empty" type |
is_empty |
True |
size == 0 |
Planner behavior: Short-circuits to UNRESOLVED for all fields.
EC2: Whitespace-only input¶
Input: b" \n\t "
| Field | Value | Reasoning |
|---|---|---|
input_type |
"text" |
No classification rule matches (not empty, no JSON/HTML/CSV/email/PDF markers) |
size |
7 |
len(b" \n\t ") |
content_hash |
SHA-256 of b" \n\t " |
Deterministic |
density |
0.0 |
0 non-whitespace chars / 7 = 0.0 |
is_empty |
True |
normalized_bytes.strip() == b"" |
Planner behavior: Same as EC1 — short-circuits to UNRESOLVED.
EC3: Invalid UTF-8¶
Input: b"\xff\xfe\x00\x01" (raw binary, not valid UTF-8)
| Field | Value | Reasoning |
|---|---|---|
input_type |
"unknown" |
Classification rules fail on binary input; fallback to "unknown" |
size |
4 |
len(b"\xff\xfe\x00\x01") |
content_hash |
SHA-256 of b"\xff\xfe\x00\x01" |
Deterministic |
density |
0.0 |
Hardcoded for "unknown" type |
is_empty |
False |
size > 0 and strip() != b"" |
Planner behavior: Treats as unclassifiable. May skip capabilities that require text input.
EC4: Very large input¶
Input: 10 MB of text, or 100 MB of text.
| Aspect | Behavior |
|---|---|
| Algorithm complexity | O(input_size). No special handling. |
| 10 MB text | ~50 ms in CPython (SHA-256 dominates). |
| 100 MB text | ~500 ms in CPython. |
| Budget interaction | The Planner may reject large inputs via Budget.max_input_size_bytes (if configured). InputProfile construction itself has no size limit. |
| Memory | The full input is held in memory during construction. V2 may add streaming/chunked profiling. |
No special handling in V1. The caller is responsible for size gating before calling paxman.normalize().
EC5: str input with non-UTF-8 characters¶
Input: "Hello \ud800 World" (lone surrogate, invalid in UTF-8)
| Field | Value | Reasoning |
|---|---|---|
| Encoding | b"Hello \xef\xbf\xbd World" |
UTF-8 with errors="replace" replaces lone surrogate with U+FFFD (encoded as \xef\xbf\xbd) |
content_hash |
SHA-256 of the replacement-encoded bytes | Deterministic per the replacement policy |
| All other fields | Computed normally from the replacement-encoded bytes |
Important: The content_hash of a str input may differ from the hash of the "same" string encoded differently. This is by design — the hash reflects the actual bytes that were profiled.
EC6: Mismatched str/bytes input¶
Input: A dict or list (already parsed JSON).
Behavior: make_profile() accepts str | bytes only. If the caller passes a dict or list, it is the caller's responsibility to serialize it first (e.g., json.dumps(data).encode("utf-8")). The api.normalize() function handles this serialization before calling make_profile().
Rationale: InputProfile operates on bytes. It does not know about structured data. The API layer is the serialization boundary.
9. Out of Scope (V1)¶
The following features are explicitly deferred to V2:
| Feature | Reason | V2 Ticket |
|---|---|---|
| Content classification (financial doc, legal doc, invoice, etc.) | Requires capability invocation or ML model. Violates "no capability invocation" invariant. | — |
| Language detection | Requires a language model or heuristic dictionary. Out of scope for a lightweight classifier. | — |
| OCR confidence / image density | V1 supports text/plain and text/html only. Image/PDF binary inputs are V2. |
— |
| Charset detection (chardet/charset-normalizer) | V1 assumes UTF-8 with replacement fallback. Charset detection adds a dependency and latency. | — |
Caching across paxman.normalize() calls |
The Planner builds InputProfile once per call. Cross-call caching is the caller's job via content_hash. |
— |
| Streaming / chunked input profiling | V1 requires the full input in memory. Chunked profiling is V2 for large-file use cases. | — |
10. See Also¶
- ARCHITECTURE.md §4.2 — Planner subsystem design, heuristic ordering, InputProfile's role in the planning pipeline.
- PACKAGE_STRUCTURE.md §4.2 — Module layout for
planner/, includinginput_profile.py. - ADR-0002: Rule-Based Planner for V1 — Why the Planner is a pure function; InputProfile's purity requirement.
- REPLAY_AND_DETERMINISM.md — How
content_hashfeeds into the replay hash and cache key.