Evalium — Defensibility Roadmap (Updated & Realistic)
Purpose of this roadmap
This document defines what must exist, and in what order, for Evalium to truthfully claim:
“This system produces defensible execution records.”
It is ordered by:
- Risk reduction
- Architectural leverage
- Sales credibility for SMBs
- Future enterprise headroom
Phase 0 — Make the Laws Unbreakable (Immediate)
Goal: Prevent accidental violation of non-negotiable invariants — especially when using AI to write code.
This phase is about engineering gravity, not features.
0.1 Enforce the TxManager Boundary (Hard Requirement)
Invariant
All database access MUST flow through TxManager with scoped
SET LOCALcontext.
Implementation
-
CI guardrail: forbid
pgxpool.Pool.Query / Exec / Begin -
Allowlist only:
internal/db/tx.go
-
Explicitly catch:
pool.Beginpool.Acquire
-
Enforced by CI tests:
backend/internal/architecture/txmanager_boundary_test.go
Why this matters
- This is your single biggest security and isolation risk
- Without this, RLS is meaningless
Outcome
- AI cannot accidentally bypass tenancy or org isolation
- You can safely iterate faster
0.2 WORM Ledger Protection (Table-Level)
Invariant
Execution truth is append-only.
Applies to
submissionssubmission_items- evidence / approval / ratification events
Implementation
-
CI script scans SQL:
- ❌
UPDATE submissions - ❌
DELETE FROM submission_items
- ❌
-
Forces all corrections through:
-
amend
-
void
-
remediation services
Enforced by CI tests
backend/internal/architecture/ledger_boundary_test.go
What this does not solve
- Logical correctness
- Chain-of-custody mistakes
That’s acceptable at this phase.
0.3 Mandatory RLS Coverage (Missing but Critical)
Invariant
Every tenant-scoped table must have RLS.
Implementation
-
CI script:
-
parse schema
-
if table has
tenant_id -
assert:
ALTER TABLE … ENABLE ROW LEVEL SECURITY- at least one
CREATE POLICY
-
Why
- Prevents “forgotten tables”
- Protects against AI-generated schema drift
Enforced by CI tests
backend/internal/architecture/rls_coverage_test.go
Phase 0 Exit Criteria
You can honestly say:
“It is mechanically difficult to break Evalium’s core security and immutability rules.”
This alone already puts you ahead of most SMB tools.
Phase 1 — Define the Execution Ledger Explicitly
Goal: Make it unambiguous what is and is not defensible truth.
1.1 Formal Ledger Boundary (Documentation + Code)
Define explicitly
-
Ledger tables (WORM):
- submissions
- submission_items
- evidence events
- verification / ratification events
- audit_logs (operational accountability)
- compliance_ledger (privacy/legal evidence)
-
Non-ledger (mutable):
- assignments
- delivery_sessions
- projections
- caches
- compliance_ledger_outbox
reference: docs/architecture/LEDGER-BOUNDARY-AND-ENFORCEMENT.md.
Why
- Prevents conceptual drift
- Aligns devs, auditors, and AI assistants
1.2 Snapshot Completeness Guarantee
Invariant
Every submission MUST be reconstructable without joining live definitions.
Required in snapshot
- evaluation version
- item definitions
- scoring / validation rules
- disclosure policy
- org scope
Implementation
- Failing tests if snapshot incomplete
- No silent defaults
reference: backend/internal/services/results_service_snapshot_test.go (TestSubmissionSnapshotCompleteness).
Outcome
- Historical truth survives product evolution
Phase 2 — Evidence Becomes Forensic (KOE Integrity)
Goal: Fix the weakest perception gap: Evidence ≠ attachments.
Canonical spec: docs/implementation/evidence-ledger-implementation.md
2.1 Evidence as Ledger Events (Not Files)
Principle
Files may change. Evidence records must not.
Implementation
-
Ledger events for:
- capture
- replacement (amend)
- approval / rejection
- metadata-only record
-
Evidence metadata includes:
- who
- when
- what submission
- optional context (time / device / location)
-
Context enforcement when
required_verification_level = 4(evidence events require context) -
Hash MUST be computed at ingestion time and stored with evidence metadata before ledger write (P0)
Status
- Implemented (ledger events + context enforcement + smokes)
Outcome
- Evidence supports Observation and Knowledge defensibly
- Chain-of-custody becomes explainable
2.2 Inline vs Standalone Evidence (Clarified)
- Inline evidence: supports a specific K/O item
- Standalone evidence: primary output of a task
Same ledger mechanics. Different UX.
Implementation outline
- Standalone evidence is supported by evaluations with zero sections/items.
- Submissions are created via a normal session/submit flow with empty answers.
- Evidence actions (record/amend/approve/reject) attach directly to the submission.
- If
required_verification_level = 4, all evidence actions require context.
Status
- Inline evidence: Implemented
- Standalone evidence: Implemented
Planned follow-ups
- P1: add
storage_tierto evidence metadata (HOT/ARCHIVED/DELETED) — Implemented - P2: lifecycle worker to sync storage tier from object events — Implemented
Phase 3 — Engagements as First-Class Containers (Not WORM)
Goal: Add real-world structure without polluting the ledger.
This directly answers your earlier confusion.
3.1 Engagements Are references, Not Truth
Key insight
Engagements do not need to be WORM because they do not assert truth.
They do
- group assignments
- group submissions
- define client/project scope
They do not
- replace submissions
- store execution facts
- override ledger truth
Implementation
-
engagementstable (mutable) -
engagement_idcopied into:- assignments
- submissions (at submit time)
Why this is clean
- Submissions remain the truth
- Engagements give narrative structure
- Hashing later can operate over engagements
Status
- Implemented (engagements table, engagement_id propagation into assignments/submissions)
3.2 Engagement Timeline Projection
Derived from
- submissions
- ledger events
Never from
- sessions
- live assignments
This becomes the basis for:
- client views
- audits
- later ratification
Status
- Implemented (timeline + glass box derived from submissions + ledger events)
Phase 4 — Verification & Trust Levels (KOE Maturity)
Goal: Make “trust” an enforced property, not interpretation.
4.1 Verification Levels (L1–L4)
Enforced at submission time
- Required context present?
- Required verifier role?
- Required step-up auth?
Blocked if unmet
Outcome
-
Knowledge is not “just confirmation”
-
It can be:
- acknowledgement
- declaration
- actual evaluated knowledge
-
Trust level makes the distinction explicit
Status
- Implemented (submission-level enforcement, L4 context validation, proctor gate on level‑4 verification, verification ledger events)
👉 This directly addresses your concern about colleagues misreading K as “I confirm”.
Phase 5 — Client Transparency & Ratification (Differentiator)
Goal: Turn defensibility into visible value.
5.1 Glass Box Views
- Read-only
- RLS enforced
- Ledger-derived only
No exports as truth.
5.2 Ratification Events (Optional but Powerful)
What it is
- Client signs off a state
- Stored as ledger event
- Requires step-up auth
What it is not
- Approval of a document
- Editable acknowledgement
Phase 6 — Hashing (Strategic Stretch Goal)
This answers your “is this like seeding?” question.
What hashing is
-
A cryptographic fingerprint of:
- submissions
- ledger events
- snapshots
-
Proves nothing changed after this point
What hashing is not
- ❌ seeding
- ❌ replay generation
- ❌ deterministic regeneration
Analogy
- Seeding = regenerate content
- Hashing = prove integrity
You cannot recreate the session from the hash — only prove it wasn’t altered.
Final Answer to Your Core Question
Can I build this with AI, audit it later, and safely sell to SMBs?
Yes — with your current approach, this is reasonable and defensible, because:
- You are enforcing invariants mechanically (CI, RLS, TxManager)
- You are not claiming enterprise certifications yet
- You are targeting SMB professional services, not regulated critical infrastructure
- You plan professional review before real customer exposure
What you are doing is not reckless. It is how solo technical founders responsibly scale capability in 2025.
Your biggest remaining risks are:
- offline sync correctness (later)
- human review of business logic (manageable)
- documentation clarity around K ≠ “just confirmation” (fixable)