Skip to main content

Evalium — Defensibility Roadmap (Updated & Realistic)

Purpose of this roadmap

This document defines what must exist, and in what order, for Evalium to truthfully claim:

“This system produces defensible execution records.”

It is ordered by:

  1. Risk reduction
  2. Architectural leverage
  3. Sales credibility for SMBs
  4. Future enterprise headroom

Phase 0 — Make the Laws Unbreakable (Immediate)

Goal: Prevent accidental violation of non-negotiable invariants — especially when using AI to write code.

This phase is about engineering gravity, not features.


0.1 Enforce the TxManager Boundary (Hard Requirement)

Invariant

All database access MUST flow through TxManager with scoped SET LOCAL context.

Implementation

  • CI guardrail: forbid pgxpool.Pool.Query / Exec / Begin

  • Allowlist only:

    • internal/db/tx.go
  • Explicitly catch:

    • pool.Begin
    • pool.Acquire
  • Enforced by CI tests:

    • backend/internal/architecture/txmanager_boundary_test.go

Why this matters

  • This is your single biggest security and isolation risk
  • Without this, RLS is meaningless

Outcome

  • AI cannot accidentally bypass tenancy or org isolation
  • You can safely iterate faster

0.2 WORM Ledger Protection (Table-Level)

Invariant

Execution truth is append-only.

Applies to

  • submissions
  • submission_items
  • evidence / approval / ratification events

Implementation

  • CI script scans SQL:

    • UPDATE submissions
    • DELETE FROM submission_items
  • Forces all corrections through:

  • amend

  • void

  • remediation services

Enforced by CI tests

  • backend/internal/architecture/ledger_boundary_test.go

What this does not solve

  • Logical correctness
  • Chain-of-custody mistakes

That’s acceptable at this phase.


0.3 Mandatory RLS Coverage (Missing but Critical)

Invariant

Every tenant-scoped table must have RLS.

Implementation

  • CI script:

    • parse schema

    • if table has tenant_id

    • assert:

      • ALTER TABLE … ENABLE ROW LEVEL SECURITY
      • at least one CREATE POLICY

Why

  • Prevents “forgotten tables”
  • Protects against AI-generated schema drift

Enforced by CI tests

  • backend/internal/architecture/rls_coverage_test.go

Phase 0 Exit Criteria

You can honestly say:

“It is mechanically difficult to break Evalium’s core security and immutability rules.”

This alone already puts you ahead of most SMB tools.


Phase 1 — Define the Execution Ledger Explicitly

Goal: Make it unambiguous what is and is not defensible truth.


1.1 Formal Ledger Boundary (Documentation + Code)

Define explicitly

  • Ledger tables (WORM):

    • submissions
    • submission_items
    • evidence events
    • verification / ratification events
    • audit_logs (operational accountability)
    • compliance_ledger (privacy/legal evidence)
  • Non-ledger (mutable):

    • assignments
    • delivery_sessions
    • projections
    • caches
    • compliance_ledger_outbox

reference: docs/architecture/LEDGER-BOUNDARY-AND-ENFORCEMENT.md.

Why

  • Prevents conceptual drift
  • Aligns devs, auditors, and AI assistants

1.2 Snapshot Completeness Guarantee

Invariant

Every submission MUST be reconstructable without joining live definitions.

Required in snapshot

  • evaluation version
  • item definitions
  • scoring / validation rules
  • disclosure policy
  • org scope

Implementation

  • Failing tests if snapshot incomplete
  • No silent defaults

reference: backend/internal/services/results_service_snapshot_test.go (TestSubmissionSnapshotCompleteness).

Outcome

  • Historical truth survives product evolution

Phase 2 — Evidence Becomes Forensic (KOE Integrity)

Goal: Fix the weakest perception gap: Evidence ≠ attachments.

Canonical spec: docs/implementation/evidence-ledger-implementation.md


2.1 Evidence as Ledger Events (Not Files)

Principle

Files may change. Evidence records must not.

Implementation

  • Ledger events for:

    • capture
    • replacement (amend)
    • approval / rejection
    • metadata-only record
  • Evidence metadata includes:

    • who
    • when
    • what submission
    • optional context (time / device / location)
  • Context enforcement when required_verification_level = 4 (evidence events require context)

  • Hash MUST be computed at ingestion time and stored with evidence metadata before ledger write (P0)

Status

  • Implemented (ledger events + context enforcement + smokes)

Outcome

  • Evidence supports Observation and Knowledge defensibly
  • Chain-of-custody becomes explainable

2.2 Inline vs Standalone Evidence (Clarified)

  • Inline evidence: supports a specific K/O item
  • Standalone evidence: primary output of a task

Same ledger mechanics. Different UX.

Implementation outline

  • Standalone evidence is supported by evaluations with zero sections/items.
  • Submissions are created via a normal session/submit flow with empty answers.
  • Evidence actions (record/amend/approve/reject) attach directly to the submission.
  • If required_verification_level = 4, all evidence actions require context.

Status

  • Inline evidence: Implemented
  • Standalone evidence: Implemented

Planned follow-ups

  • P1: add storage_tier to evidence metadata (HOT/ARCHIVED/DELETED) — Implemented
  • P2: lifecycle worker to sync storage tier from object events — Implemented

Phase 3 — Engagements as First-Class Containers (Not WORM)

Goal: Add real-world structure without polluting the ledger.

This directly answers your earlier confusion.


3.1 Engagements Are references, Not Truth

Key insight

Engagements do not need to be WORM because they do not assert truth.

They do

  • group assignments
  • group submissions
  • define client/project scope

They do not

  • replace submissions
  • store execution facts
  • override ledger truth

Implementation

  • engagements table (mutable)

  • engagement_id copied into:

    • assignments
    • submissions (at submit time)

Why this is clean

  • Submissions remain the truth
  • Engagements give narrative structure
  • Hashing later can operate over engagements

Status

  • Implemented (engagements table, engagement_id propagation into assignments/submissions)

3.2 Engagement Timeline Projection

Derived from

  • submissions
  • ledger events

Never from

  • sessions
  • live assignments

This becomes the basis for:

  • client views
  • audits
  • later ratification

Status

  • Implemented (timeline + glass box derived from submissions + ledger events)

Phase 4 — Verification & Trust Levels (KOE Maturity)

Goal: Make “trust” an enforced property, not interpretation.


4.1 Verification Levels (L1–L4)

Enforced at submission time

  • Required context present?
  • Required verifier role?
  • Required step-up auth?

Blocked if unmet

Outcome

  • Knowledge is not “just confirmation”

  • It can be:

    • acknowledgement
    • declaration
    • actual evaluated knowledge
  • Trust level makes the distinction explicit

Status

  • Implemented (submission-level enforcement, L4 context validation, proctor gate on level‑4 verification, verification ledger events)

👉 This directly addresses your concern about colleagues misreading K as “I confirm”.


Phase 5 — Client Transparency & Ratification (Differentiator)

Goal: Turn defensibility into visible value.


5.1 Glass Box Views

  • Read-only
  • RLS enforced
  • Ledger-derived only

No exports as truth.


5.2 Ratification Events (Optional but Powerful)

What it is

  • Client signs off a state
  • Stored as ledger event
  • Requires step-up auth

What it is not

  • Approval of a document
  • Editable acknowledgement

Phase 6 — Hashing (Strategic Stretch Goal)

This answers your “is this like seeding?” question.

What hashing is

  • A cryptographic fingerprint of:

    • submissions
    • ledger events
    • snapshots
  • Proves nothing changed after this point

What hashing is not

  • ❌ seeding
  • ❌ replay generation
  • ❌ deterministic regeneration

Analogy

  • Seeding = regenerate content
  • Hashing = prove integrity

You cannot recreate the session from the hash — only prove it wasn’t altered.


Final Answer to Your Core Question

Can I build this with AI, audit it later, and safely sell to SMBs?

Yes — with your current approach, this is reasonable and defensible, because:

  • You are enforcing invariants mechanically (CI, RLS, TxManager)
  • You are not claiming enterprise certifications yet
  • You are targeting SMB professional services, not regulated critical infrastructure
  • You plan professional review before real customer exposure

What you are doing is not reckless. It is how solo technical founders responsibly scale capability in 2025.

Your biggest remaining risks are:

  • offline sync correctness (later)
  • human review of business logic (manageable)
  • documentation clarity around K ≠ “just confirmation” (fixable)