Reporting v1 Implementation Spec — Update: Premium Question Health
0. Purpose & non-goals
Purpose
Deliver sub-second, sliceable reporting for SMB admins without compromising auditability, by projecting immutable attempt data into flattened, RLS-scoped read-model tables, and exposing UI-ready BFF endpoints.
Non-goals (v1)
- Full psychometrics (discrimination indices, alpha/KR-20, IRT)
- External warehouse / dedicated analytics DB
- Multi-org “as-of” identity repair workflows (parked)
1. Core invariants (already true in Evalium)
-
version_snapshotis canonical for reporting. Never rebuild structure/tags by joining live authoring tables. -
Write-time metrics. Correctness/scoring are computed at submit-time, not report-time.
-
RLS + TxManager scope injection is mandatory. All reporting reads/writes must run inside TxManager transactions with
SET LOCALscope variables. -
Remediation is ledger-like + idempotent. Scores evolve via append-only score versions + “latest” pointer semantics, and apply is idempotent.
-
Compliance uses anonymisation (Hybrid Scrub). User identity is scrubbed in
users, submissions keepuser_id, and submission answer JSON may be redacted in-place for free-text fields.
2. Architecture decision: “calmer middle ground” (no separate reporting DB in v1)
v1 deployment
-
Reporting read-model tables live in the primary Postgres (same cluster), but:
- they’re append/UPSERT-heavy, not query-heavy
- query paths hit only the flattened tables, not the OLTP join graph
- indexes are designed specifically for the reporting query patterns
v1 seam for scaling (intentionally built in)
Introduce a small abstraction:
-
ReportingStoreinterface (used by ReportingService handlers) -
two implementations:
ReportingStorePrimary(default in v1)ReportingStoreReplica(drop-in later)
Reporting uses ReportingStore; v1 targets the primary DB, and reads can be routed to a replica later via configuration without handler changes.
When you decide “scale is enough”, you only:
- create a replica
- point
ReportingStoreReplicaDSN at it - keep exact same schema + queries (replication copies the tables)
This avoids “architecture rewrite anxiety” without paying the full operational complexity on day 1.
3. What ships in v1
3.1 Central Reports Hub (minimal, high value)
- Evaluation summary (access points, attempts completed, pass rate, avg/median score, completion time)
- Session Attempt Report (per-session reconstruction: structure + tags + item attempts + telemetry rollups)
GET /session-attempts/{submissionId}(submission-based;attempt.idequalssubmission.id)- CSV export:
GET /session-attempts/{submissionId}/export
Evaluation summary (v1)
GET /evaluation-summary
CSV export is available via GET /evaluation-summary/export (same filters; single-row CSV).
Version-scoped summary for a single evaluation_version_id (default v1).
Counters (no silent collapsing):
assigned_individuals_n→ count of assignments with target typeuser_idoremailassigned_groups_n→ count of group assignmentspublic_links_n→ count of public-link assignmentsassignments_n→ total assignments (sum of the above)unique_invitees_n→ distincttarget_refforuser_id/emailassignments (lowercased)
Completion/graded metrics:
completions_n=submissions.status = completedANDcompleted_at IS NOT NULL(attempt-based)pass_rate=pass / (pass + fail)usingsubmissions.outcome_code(attempt-based)pass_n,fail_n,graded_n,ungraded_nare returned explicitly (attempt-based)outcomes.outcomeKnownAttemptsN+outcomes.outcomeKnownRateshow outcome coverage
Timing (v1):
- completion time uses
submissions.completed_at - submissions.started_at - v1 returns avg + median if available; no p90 yet
timing.timeKnownAttemptsN+timing.timeKnownRateshow coverage (rate over completed attempts)
Integrity rollup (v1):
integrity.telemetryKnownAttempts+integrity.telemetryKnownRateshow telemetry coverageintegrity.sessionsWithFocusLost/integrity.sessionsWithFullscreenExitcount attempts with those signalsintegrity.rateDenominator = attempts_completed(rates are computed over completed attempts)
Scores (v1):
scores.scoreKnownAttemptsN+scores.scoreKnownRateshow score coverage (graded attempts only)scores.scoreDenominator = attempts_gradedscores.scoreKnownAttemptsN == outcomes.attemptsGraded(hard invariant)scores.histogramSpecdocuments bucket ranges + labels for the histogram (UI-ready)
Scores (provisional):
scoresProvisionaluses scored attempts (score/max_score known), even if outcome is ungradedscoresProvisional.scoreDenominator = attempts_scoredscoresProvisional.scoredButUngradedNsurfaces graded gaps without frontend mathscoresProvisional.histogramSpecmirrorsscores.histogramSpec
Filters:
- required:
evaluationVersionId - optional:
runLabel,groupId(assignment target group),from,to(completed_at window) - optional:
scope=allVersions(aggregate across all versions for the evaluation of the suppliedevaluationVersionId)
UI-ready blocks (v1, blocks-only contract):
scopeechoes the filter context (evaluation version, runLabel, groupId, from/to).scope.filtersAppliedmirrors the appliedrunLabel/groupId/from/tofilters.
herois the canonical header block (attempt completion + pass rates, score/timing).hero.accessPointsmirrorspopulation.accessPoints.hero.completionRateonly when population coverage isfull; otherwisenull.
hero.completionRateDenominator+hero.completionRateDisplayare always populated.hero.provisionalScoreAvailable+hero.provisionalScoreNotesurface provisional score gaps without UI logic.hero.scoreModeDefault+hero.scoreShownDenominatormake the default score view explicit.populationreports access-point counts and population coverage (full|partial|unknown).attemptsreports attempt counts + unique users (to make “multiple attempts” explicit).distributionsplits assignment targets +distributionCount.funnelreturns access points → started → completed withinProgress,expired,notStarted, and an attempt-basedcompletionRate(denominator =attempts_started).- If
population.coverage != full,funnel.notStartedisnullwithnotStartedDisplay+notStartedReason.
- If
outcomes.passRateDenominator = attempts_graded(attempt-based pass/fail counts).scoresincludes a coarse histogram andscoreDenominator = attempts_graded(plushistogramSpec).outcomes,scores,scoresProvisional,timing,activitygroup core metrics + time source + timing/score/outcome coverage.attentionincludesmetrics+thresholdsfor explainability (no frontend rules).attentionprovides low/medium/high + reasons for quick triage.actionsAllowed+linkskeep the frontend permission-free (links return route params, not URL strings).completionMethodsreturns attempt counts by completion method plus per-method pass/fail rollups.
Group list view (v1):
GET /evaluation-summary/groups- Cohort semantics are assignment-target groups only (no live membership expansion).
- Returns a thin DTO by default (hero + attempts + outcomes + attention) with optional
include=forscores,timing,distributions,tags,sections.
Cohort list view (v1):
GET /evaluation-summary/cohorts- Cohorts are run-label buckets (assignment/submission
run_label), not dynamic membership groups. - Returns a thin DTO by default (hero + attempts + outcomes + attention) with optional
include=forscores,timing,distributions,tags,sections. - Unlabeled attempts appear with
runLabel = nullandlabel = "Unlabeled"in the response.
Group list export (v1):
GET /evaluation-summary/groups/export- CSV export for the group list, using the same filters as the list endpoint.
Cohort list export (v1):
GET /evaluation-summary/cohorts/export- CSV export for the cohort list, using the same filters as the list endpoint.
Session attempt report (v1)
GET /session-attempts/{submissionId}
Per-submission reconstruction for admins:
summaryis UI-ready (duration, total/answered/omitted, score pct, attempt status, completion method, latest score version).timingincludes started/ended timestamps, duration, duration source, total time-on-items, and idle estimate (max(durationMs - totalTimeOnItemsMs, 0)when duration source is known).integrityincludes derived flags + plain-English summary text + attention level + telemetry counts + computed metrics.actionsAllowed+linksprovide UI-ready affordances (remediation, evaluation summary, submission, user profile) without frontend capability checks.tagAggregates,tagKeySummaries, andsectionAggregatesprovide per-attempt rollups (facility, omit, time, avg score).submissionincludes immutableversionSnapshot(structure + tags) plus score/outcome (nosessionIdfield in this report).sessionis attempt context (submission-based);session.idequalssubmission.idand includes timings (startedAt,endedAt,lastActivityAt), run label, and overrides.itemsinclude answer payload, scoring fields, timing, and UI helpers (isOmitted,selectedChoiceId,displayAnswer,tagList,stemText, choice texts, scoring explanation).integrity.telemetrySummaryaggregates counts fromreporting.submission_telemetry_summary(durable, no raw events).
Rules:
- Uses
submissions.version_snapshotas canonical structure (no joins to live authoring). - No raw telemetry events in v1; rollups only.
- Guarded by
reporting.view_named.
3.2 Contextual Intelligence (in-workflow)
- Question Health (Premium) in the Question Bank: Health badges + drilldown per
question_version_id, powered by flattened attempt facts and rollups (facility, omit, time, option intel, drift, exposure, integrity signals). Uses immutable snapshots + structured tags for defensibility.
4. Reporting read-model strategy (v1)
4.1 Projection model
A background worker consumes events (submission finalised, remediation applied) and projects into reporting tables.
-
Source of truth:
submissions.version_snapshotsubmission_items- “operational truth” score selection from latest score version semantics
-
Admin/backfill trigger:
POST /api/v1/reporting/projectionsenqueues submission/range/remediation/telemetry jobs (guarded byreporting.view_named).
-
Durability rule:
- Reporting read-model tables are submission-based and must not depend on
delivery_sessions(no joins/FKs).
- Reporting read-model tables are submission-based and must not depend on
4.2 “No PII denormalisation” rule (compliance alignment)
Reporting tables may store:
user_id(OK) They must not store:- name, email, or
users.metadatacopies
Any UI that needs display identity joins JIT to users (which is already scrubbed by compliance).
5.4 Contextual Intelligence v1 — Question Health (Premium)
5.4.1 Purpose
Give Authors/Admins an at-a-glance view of whether questions are behaving well, plus a drilldown that feels “enterprise” without claiming full psychometrics.
Why it’s safe in Evalium:
version_snapshotfreezes the delivered structure + tags- scoring/correctness are computed at submit-time
- integrity signals can be added from durable submission telemetry summaries when available
5.4.2 Unit of analysis
All metrics computed per question_version_id.
5.4.3 Data sources
Primary (defensible)
submissions.version_snapshot→ tags + delivered structuresubmission_items(responses + timestamps)- latest score semantics for post-remediation operational truth
Typed analysis (MCQ/MRQ)
reporting.report_item_attempt_choice(selected options, patterns, partial credit flags)
5.4.4 Read-model tables (v1)
A) reporting.report_item_attempts_flat (fact table)
1 row per item attempt (keyed by submission_item_id), written by projection worker.
Minimum columns:
tenant_id,org_unit_id(RLS scoped)submission_id,submission_item_id,user_idcompleted_atevaluation_id,evaluation_version_idrun_labelquestion_version_idqtypeis_correct,score_awarded,max_scorescore_status,score_method,invalid_codeselected_option(MCQ only; nullable)choice_ids(raw array; nullable to keep MCQ-multi flexible later)is_omitted,omit_reasontime_on_item_ms(nullable; prefertime_spent_ms, thenanswered_at - started_at)structured_tags jsonb(copied from snapshot for fast slicing)
Rule: never re-join to live authoring tables to rebuild tags/structure.
B) reporting.report_item_attempt_choice (typed facts)
1 row per item attempt (keyed by submission_item_id), used for MCQ/MRQ analysis panels.
Minimum columns:
tenant_id,org_unit_idsubmission_item_idquestion_version_id,completed_atselected_option_ids(array)response_pattern_hash(nullable; MRQ patterns)is_full_credit,is_partial_creditcorrect_selected,incorrect_selected,total_correctconstraint_violation_code(nullable)
C) reporting.report_question_health_rollup (Question Bank UI)
B) reporting.report_question_health_rollup (Question Bank UI)
1 row per question_version_id, incrementally updated.
Minimum columns:
- keys:
(tenant_id, org_unit_id, question_version_id) - counts:
attempts_n,correct_n,omitted_n - status counts:
scored_n,pending_n,invalid_n,exempt_n - facility:
facility_pct - time:
avg_time_ms,median_time_ms,p90_time_ms - scoring:
avg_score_awarded,avg_score_pct - option intel:
option_counts jsonb,option_pct jsonb - drift:
facility_7d_pct,facility_30d_pct,facility_last_run_pct - exposure:
evaluations_count,recent_runs jsonb - badges:
health_status,confidence_level,top_reasons jsonb,flags jsonb last_computed_at
Facility definition: facility uses scored attempts only (correct / scored where score_status = SCORED).
Percentiles note (pragmatic “premium”):
- Don’t try to maintain exact percentiles incrementally on every event.
- Store
avg_time_msincrementally. - Recompute
median_time_ms/p90_time_msvia a low-frequency job (e.g., hourly) per tenant/org “dirty set” of question versions. - For manual runs, use
POST /api/v1/reporting/percentiles/recompute(guarded byreporting.view_named).
That keeps the UI premium without making projections expensive.
D) Optional: reporting.report_question_health_by_run
Keyed by (tenant_id, org_unit_id, question_version_id, run_label):
- attempts_n, facility_pct, omit_rate, median_time_ms, option_pct jsonb
This makes drift/cohort comparisons instant.
5.4.5 Projection + update semantics
Events
- On submission finalisation: project attempt rows + update rollups.
- On remediation apply: re-project affected submissions so rollups reflect latest operational truth.
Operational truth vs snapshot truth
- Health v1 defaults to operational truth (latest score).
- Later you can add an “audit mode” toggle pinning to score version at time-of-attempt (not needed v1).
- Scoring aggregates (meanScorePct, histograms, trends) use score_status = SCORED only.
- Omitted attempts are treated as
EXEMPTand are excluded from scored aggregates by default.
- Omitted attempts are treated as
5.4.6 Premium heuristics (explicitly not psychometrics)
Confidence levels
LOW: N < 30MED: 30 ≤ N < 100HIGH: N ≥ 100
Core flags
(activate only when confidence MED/HIGH unless stated)
-
TOO_EASY: facility ≥ 0.90 and N ≥ 30 -
TOO_HARD: facility ≤ 0.20 and N ≥ 30 -
HIGH_OMIT: omit ≥ 0.10 and N ≥ 30 -
TIME_SPIKE: median_time ≥ (baseline × 2) and N ≥ 30- baseline v1 fallback: compare 7d vs 30d for the same question_version_id
MCQ option intelligence flags
NON_FUNCTIONING_DISTRACTOR: distractor < 2% and N ≥ 50DISTRACTOR_DOMINANCE: one wrong option ≥ 50% and facility ≤ 50% and N ≥ 50SPLIT_DISTRACTORS: two wrong options ≥ 25% each and facility ≤ 60% and N ≥ 50
Drift flags
FACILITY_DRIFT_UP: last_run − 30d ≥ +0.20 and attempts_last_run ≥ 20FACILITY_DRIFT_DOWN: last_run − 30d ≤ −0.20 and attempts_last_run ≥ 20
Integrity signal (optional but premium)
HIGH_FOCUS_LOST_RATE: sessions with focus-lost / attempts ≥ 0.25 and N ≥ 50
5.4.7 UI-ready API contracts (BFF)
List (Question Bank overlay)
GET /question-health
Supports:
- tag filters from snapshot tags (topic/skill/difficulty…)
- time range, evaluation_version_id, run_label
- sorting:
needs_attention_first,highest_omit,most_drift,most_exposed
Returns per row:
questionVersionIdhealthBadge { status, confidence, topReasons[] }core { attempts, scoredAttempts, correct, omitted, facilityPct, omitRate, medianTimeMs, statusCounts, invalidRate }analysis.choiceSingle.topOption(MCQ single only)
Export (CSV)
GET /question-health/export
Same filters as the list endpoint. Returns a ZIP bundle for offline analysis (no PII; uses rollup data only).
Bundle contents:
question_health_core.csv(always)question_health_choice_single.csv+choice_single_option_breakdown.csv(MCQ single only)question_health_choice_multi.csv+choice_multi_option_breakdown.csv+choice_multi_pattern_breakdown.csv+choice_multi_constraint_breakdown.csv(MRQ only)manifest.json(export version + file row counts)
Detail (Drilldown)
GET /question-versions/{questionVersionId}/health
Returns:
corerollup metrics (counts, timing, scoring, statusCounts, exposure, drift, badges)analysis.choiceSingle(MCQ single option distribution)analysis.choiceMulti(MRQ inclusion rates, patterns, partial credit histogram, error modes, constraint violations)- per-run comparisons (if by_run table exists)
- drift timeline buckets (weekly)
- exposure summary
Denominator rules (explicit):
facilityPct= full-credit count /scoredAttemptsavgScorePct= mean overscoredAttemptsoptionPct+topPatterns.pct= share ofscoredAttemptsomitRate=exempt / attemptsinvalidRate=invalid / attempts
MRQ errorModes note: counts may overlap (a single attempt can increment multiple error modes).
Permissions
Treat as reporting-grade:
reporting.viewfor list/detail (consistent with reporting capability tightening already in the system)
6. Report catalogue mapping
v1 Contextual Intelligence
- Question Bank Health (Premium) (this section)
Deferred psychometrics
- discrimination indices
- reliability coefficients (alpha/KR-20)
- IRT
7. Compliance integration (closing the ADR-0011 gap)
Reporting adds new tables containing user_id, therefore:
- “Forget User (Hybrid Scrub)” does not delete these reporting rows (they contain no direct identifiers).
- Any DSAR/export or UI identity display joins to
usersat query time, so scrubbed identity is respected. - If any reporting table ever stores free-text fields (avoid in v1), it must participate in the same redaction approach used for submission answer JSON.
8. Acceptance criteria (add Question Health)
-
Question Bank list shows Health badges per question version with:
- facility %, omit rate, median time, N, top reason
-
Drilldown provides:
- option distribution (MCQ), drift vs run_label / time buckets, exposure
-
Heuristic flags:
- confidence-gated (MED/HIGH)
- labelled as “heuristic” (not psychometrics)
-
Remediation re-projection updates rollups correctly without duplication.
-
All reads/writes are TxManager + RLS scoped.
-
Reporting uses
ReportingStore; v1 targets primary DB, replica routing is a config change. -
Reporting tables do not denormalise user PII (no names/emails/metadata copies).