Reporting v1 Implementation Spec — Update: Premium Question Health

0. Purpose & non-goals

Purpose

Deliver sub-second, sliceable reporting for SMB admins without compromising auditability, by projecting immutable attempt data into flattened, RLS-scoped read-model tables, and exposing UI-ready BFF endpoints.

Non-goals (v1)

Full psychometrics (discrimination indices, alpha/KR-20, IRT)
External warehouse / dedicated analytics DB
Multi-org “as-of” identity repair workflows (parked)

1. Core invariants (already true in Evalium)

version_snapshot is canonical for reporting. Never rebuild structure/tags by joining live authoring tables.
Write-time metrics. Correctness/scoring are computed at submit-time, not report-time.
RLS + TxManager scope injection is mandatory. All reporting reads/writes must run inside TxManager transactions with SET LOCAL scope variables.
Remediation is ledger-like + idempotent. Scores evolve via append-only score versions + “latest” pointer semantics, and apply is idempotent.
Compliance uses anonymisation (Hybrid Scrub). User identity is scrubbed in users, submissions keep user_id, and submission answer JSON may be redacted in-place for free-text fields.

2. Architecture decision: “calmer middle ground” (no separate reporting DB in v1)

v1 deployment

Reporting read-model tables live in the primary Postgres (same cluster), but:
- they’re append/UPSERT-heavy, not query-heavy
- query paths hit only the flattened tables, not the OLTP join graph
- indexes are designed specifically for the reporting query patterns

v1 seam for scaling (intentionally built in)

Introduce a small abstraction:

ReportingStore interface (used by ReportingService handlers)
two implementations:
- ReportingStorePrimary (default in v1)
- ReportingStoreReplica (drop-in later)

Reporting uses ReportingStore; v1 targets the primary DB, and reads can be routed to a replica later via configuration without handler changes.

When you decide “scale is enough”, you only:

create a replica
point ReportingStoreReplica DSN at it
keep exact same schema + queries (replication copies the tables)

This avoids “architecture rewrite anxiety” without paying the full operational complexity on day 1.

3. What ships in v1

3.1 Central Reports Hub (minimal, high value)

Evaluation summary (access points, attempts completed, pass rate, avg/median score, completion time)
Session Attempt Report (per-session reconstruction: structure + tags + item attempts + telemetry rollups)
- GET /session-attempts/{submissionId} (submission-based; attempt.id equals submission.id)
- CSV export: GET /session-attempts/{submissionId}/export

Evaluation summary (v1)

GET /evaluation-summary

CSV export is available via GET /evaluation-summary/export (same filters; single-row CSV).

Version-scoped summary for a single evaluation_version_id (default v1).

Counters (no silent collapsing):

assigned_individuals_n → count of assignments with target type user_id or email
assigned_groups_n → count of group assignments
public_links_n → count of public-link assignments
assignments_n → total assignments (sum of the above)
unique_invitees_n → distinct target_ref for user_id/email assignments (lowercased)

Completion/graded metrics:

completions_n = submissions.status = completed AND completed_at IS NOT NULL (attempt-based)
pass_rate = pass / (pass + fail) using submissions.outcome_code (attempt-based)
pass_n, fail_n, graded_n, ungraded_n are returned explicitly (attempt-based)
outcomes.outcomeKnownAttemptsN + outcomes.outcomeKnownRate show outcome coverage

Timing (v1):

completion time uses submissions.completed_at - submissions.started_at
v1 returns avg + median if available; no p90 yet
timing.timeKnownAttemptsN + timing.timeKnownRate show coverage (rate over completed attempts)

Integrity rollup (v1):

integrity.telemetryKnownAttempts + integrity.telemetryKnownRate show telemetry coverage
integrity.sessionsWithFocusLost / integrity.sessionsWithFullscreenExit count attempts with those signals
integrity.rateDenominator = attempts_completed (rates are computed over completed attempts)

Scores (v1):

scores.scoreKnownAttemptsN + scores.scoreKnownRate show score coverage (graded attempts only)
scores.scoreDenominator = attempts_graded
scores.scoreKnownAttemptsN == outcomes.attemptsGraded (hard invariant)
scores.histogramSpec documents bucket ranges + labels for the histogram (UI-ready)

Scores (provisional):

scoresProvisional uses scored attempts (score/max_score known), even if outcome is ungraded
scoresProvisional.scoreDenominator = attempts_scored
scoresProvisional.scoredButUngradedN surfaces graded gaps without frontend math
scoresProvisional.histogramSpec mirrors scores.histogramSpec

Filters:

required: evaluationVersionId
optional: runLabel, groupId (assignment target group), from, to (completed_at window)
optional: scope=allVersions (aggregate across all versions for the evaluation of the supplied evaluationVersionId)

UI-ready blocks (v1, blocks-only contract):

scope echoes the filter context (evaluation version, runLabel, groupId, from/to).
- scope.filtersApplied mirrors the applied runLabel/groupId/from/to filters.
hero is the canonical header block (attempt completion + pass rates, score/timing).
- hero.accessPoints mirrors population.accessPoints.
- hero.completionRate only when population coverage is full; otherwise null.
hero.completionRateDenominator + hero.completionRateDisplay are always populated.
hero.provisionalScoreAvailable + hero.provisionalScoreNote surface provisional score gaps without UI logic.
hero.scoreModeDefault + hero.scoreShownDenominator make the default score view explicit.
population reports access-point counts and population coverage (full|partial|unknown).
attempts reports attempt counts + unique users (to make “multiple attempts” explicit).
distribution splits assignment targets + distributionCount.
funnel returns access points → started → completed with inProgress, expired, notStarted, and an attempt-based completionRate (denominator = attempts_started).
- If population.coverage != full, funnel.notStarted is null with notStartedDisplay + notStartedReason.
outcomes.passRateDenominator = attempts_graded (attempt-based pass/fail counts).
scores includes a coarse histogram and scoreDenominator = attempts_graded (plus histogramSpec).
outcomes, scores, scoresProvisional, timing, activity group core metrics + time source + timing/score/outcome coverage.
attention includes metrics + thresholds for explainability (no frontend rules).
attention provides low/medium/high + reasons for quick triage.
actionsAllowed + links keep the frontend permission-free (links return route params, not URL strings).
completionMethods returns attempt counts by completion method plus per-method pass/fail rollups.

Group list view (v1):

GET /evaluation-summary/groups
Cohort semantics are assignment-target groups only (no live membership expansion).
Returns a thin DTO by default (hero + attempts + outcomes + attention) with optional include= for scores, timing, distributions, tags, sections.

Cohort list view (v1):

GET /evaluation-summary/cohorts
Cohorts are run-label buckets (assignment/submission run_label), not dynamic membership groups.
Returns a thin DTO by default (hero + attempts + outcomes + attention) with optional include= for scores, timing, distributions, tags, sections.
Unlabeled attempts appear with runLabel = null and label = "Unlabeled" in the response.

Group list export (v1):

GET /evaluation-summary/groups/export
CSV export for the group list, using the same filters as the list endpoint.

Cohort list export (v1):

GET /evaluation-summary/cohorts/export
CSV export for the cohort list, using the same filters as the list endpoint.

Session attempt report (v1)

GET /session-attempts/{submissionId}

Per-submission reconstruction for admins:

summary is UI-ready (duration, total/answered/omitted, score pct, attempt status, completion method, latest score version).
timing includes started/ended timestamps, duration, duration source, total time-on-items, and idle estimate (max(durationMs - totalTimeOnItemsMs, 0) when duration source is known).
integrity includes derived flags + plain-English summary text + attention level + telemetry counts + computed metrics.
actionsAllowed + links provide UI-ready affordances (remediation, evaluation summary, submission, user profile) without frontend capability checks.
tagAggregates, tagKeySummaries, and sectionAggregates provide per-attempt rollups (facility, omit, time, avg score).
submission includes immutable versionSnapshot (structure + tags) plus score/outcome (no sessionId field in this report).
session is attempt context (submission-based); session.id equals submission.id and includes timings (startedAt, endedAt, lastActivityAt), run label, and overrides.
items include answer payload, scoring fields, timing, and UI helpers (isOmitted, selectedChoiceId, displayAnswer, tagList, stemText, choice texts, scoring explanation).
integrity.telemetrySummary aggregates counts from reporting.submission_telemetry_summary (durable, no raw events).

Rules:

Uses submissions.version_snapshot as canonical structure (no joins to live authoring).
No raw telemetry events in v1; rollups only.
Guarded by reporting.view_named.

3.2 Contextual Intelligence (in-workflow)

Question Health (Premium) in the Question Bank: Health badges + drilldown per question_version_id, powered by flattened attempt facts and rollups (facility, omit, time, option intel, drift, exposure, integrity signals). Uses immutable snapshots + structured tags for defensibility.

4. Reporting read-model strategy (v1)

4.1 Projection model

A background worker consumes events (submission finalised, remediation applied) and projects into reporting tables.

Source of truth:
- submissions.version_snapshot
- submission_items
- “operational truth” score selection from latest score version semantics
Admin/backfill trigger:
- POST /api/v1/reporting/projections enqueues submission/range/remediation/telemetry jobs (guarded by reporting.view_named).
Durability rule:
- Reporting read-model tables are submission-based and must not depend on delivery_sessions (no joins/FKs).

4.2 “No PII denormalisation” rule (compliance alignment)

Reporting tables may store:

user_id (OK) They must not store:
name, email, or users.metadata copies

Any UI that needs display identity joins JIT to users (which is already scrubbed by compliance).

5.4 Contextual Intelligence v1 — Question Health (Premium)

5.4.1 Purpose

Give Authors/Admins an at-a-glance view of whether questions are behaving well, plus a drilldown that feels “enterprise” without claiming full psychometrics.

Why it’s safe in Evalium:

version_snapshot freezes the delivered structure + tags
scoring/correctness are computed at submit-time
integrity signals can be added from durable submission telemetry summaries when available

5.4.2 Unit of analysis

All metrics computed per question_version_id.

5.4.3 Data sources

Primary (defensible)

submissions.version_snapshot → tags + delivered structure
submission_items (responses + timestamps)
latest score semantics for post-remediation operational truth

Typed analysis (MCQ/MRQ)

reporting.report_item_attempt_choice (selected options, patterns, partial credit flags)

5.4.4 Read-model tables (v1)

A) `reporting.report_item_attempts_flat` (fact table)

1 row per item attempt (keyed by submission_item_id), written by projection worker.

Minimum columns:

tenant_id, org_unit_id (RLS scoped)
submission_id, submission_item_id, user_id
completed_at
evaluation_id, evaluation_version_id
run_label
question_version_id
qtype
is_correct, score_awarded, max_score
score_status, score_method, invalid_code
selected_option (MCQ only; nullable)
choice_ids (raw array; nullable to keep MCQ-multi flexible later)
is_omitted, omit_reason
time_on_item_ms (nullable; prefer time_spent_ms, then answered_at - started_at)
structured_tags jsonb (copied from snapshot for fast slicing)

Rule: never re-join to live authoring tables to rebuild tags/structure.

B) `reporting.report_item_attempt_choice` (typed facts)

1 row per item attempt (keyed by submission_item_id), used for MCQ/MRQ analysis panels.

Minimum columns:

tenant_id, org_unit_id
submission_item_id
question_version_id, completed_at
selected_option_ids (array)
response_pattern_hash (nullable; MRQ patterns)
is_full_credit, is_partial_credit
correct_selected, incorrect_selected, total_correct
constraint_violation_code (nullable)

C) `reporting.report_question_health_rollup` (Question Bank UI)

B) `reporting.report_question_health_rollup` (Question Bank UI)

1 row per question_version_id, incrementally updated.

Minimum columns:

keys: (tenant_id, org_unit_id, question_version_id)
counts: attempts_n, correct_n, omitted_n
status counts: scored_n, pending_n, invalid_n, exempt_n
facility: facility_pct
time: avg_time_ms, median_time_ms, p90_time_ms
scoring: avg_score_awarded, avg_score_pct
option intel: option_counts jsonb, option_pct jsonb
drift: facility_7d_pct, facility_30d_pct, facility_last_run_pct
exposure: evaluations_count, recent_runs jsonb
badges: health_status, confidence_level, top_reasons jsonb, flags jsonb
last_computed_at

Facility definition: facility uses scored attempts only (correct / scored where score_status = SCORED).

Percentiles note (pragmatic “premium”):

Don’t try to maintain exact percentiles incrementally on every event.
Store avg_time_ms incrementally.
Recompute median_time_ms / p90_time_ms via a low-frequency job (e.g., hourly) per tenant/org “dirty set” of question versions.
For manual runs, use POST /api/v1/reporting/percentiles/recompute (guarded by reporting.view_named).

That keeps the UI premium without making projections expensive.

D) Optional: `reporting.report_question_health_by_run`

Keyed by (tenant_id, org_unit_id, question_version_id, run_label):

attempts_n, facility_pct, omit_rate, median_time_ms, option_pct jsonb

This makes drift/cohort comparisons instant.

5.4.5 Projection + update semantics

Events

On submission finalisation: project attempt rows + update rollups.
On remediation apply: re-project affected submissions so rollups reflect latest operational truth.

Operational truth vs snapshot truth

Health v1 defaults to operational truth (latest score).
Later you can add an “audit mode” toggle pinning to score version at time-of-attempt (not needed v1).
Scoring aggregates (meanScorePct, histograms, trends) use score_status = SCORED only.
- Omitted attempts are treated as EXEMPT and are excluded from scored aggregates by default.

5.4.6 Premium heuristics (explicitly not psychometrics)

Confidence levels

LOW: N < 30
MED: 30 ≤ N < 100
HIGH: N ≥ 100

Core flags

(activate only when confidence MED/HIGH unless stated)

TOO_EASY: facility ≥ 0.90 and N ≥ 30
TOO_HARD: facility ≤ 0.20 and N ≥ 30
HIGH_OMIT: omit ≥ 0.10 and N ≥ 30
TIME_SPIKE: median_time ≥ (baseline × 2) and N ≥ 30
- baseline v1 fallback: compare 7d vs 30d for the same question_version_id

MCQ option intelligence flags

NON_FUNCTIONING_DISTRACTOR: distractor < 2% and N ≥ 50
DISTRACTOR_DOMINANCE: one wrong option ≥ 50% and facility ≤ 50% and N ≥ 50
SPLIT_DISTRACTORS: two wrong options ≥ 25% each and facility ≤ 60% and N ≥ 50

Drift flags

FACILITY_DRIFT_UP: last_run − 30d ≥ +0.20 and attempts_last_run ≥ 20
FACILITY_DRIFT_DOWN: last_run − 30d ≤ −0.20 and attempts_last_run ≥ 20

Integrity signal (optional but premium)

HIGH_FOCUS_LOST_RATE: sessions with focus-lost / attempts ≥ 0.25 and N ≥ 50

5.4.7 UI-ready API contracts (BFF)

List (Question Bank overlay)

GET /question-health

Supports:

tag filters from snapshot tags (topic/skill/difficulty…)
time range, evaluation_version_id, run_label
sorting: needs_attention_first, highest_omit, most_drift, most_exposed

Returns per row:

questionVersionId
healthBadge { status, confidence, topReasons[] }
core { attempts, scoredAttempts, correct, omitted, facilityPct, omitRate, medianTimeMs, statusCounts, invalidRate }
analysis.choiceSingle.topOption (MCQ single only)

Export (CSV)

GET /question-health/export

Same filters as the list endpoint. Returns a ZIP bundle for offline analysis (no PII; uses rollup data only).

Bundle contents:

question_health_core.csv (always)
question_health_choice_single.csv + choice_single_option_breakdown.csv (MCQ single only)
question_health_choice_multi.csv + choice_multi_option_breakdown.csv + choice_multi_pattern_breakdown.csv + choice_multi_constraint_breakdown.csv (MRQ only)
manifest.json (export version + file row counts)

Detail (Drilldown)

GET /question-versions/{questionVersionId}/health

Returns:

core rollup metrics (counts, timing, scoring, statusCounts, exposure, drift, badges)
analysis.choiceSingle (MCQ single option distribution)
analysis.choiceMulti (MRQ inclusion rates, patterns, partial credit histogram, error modes, constraint violations)
per-run comparisons (if by_run table exists)
drift timeline buckets (weekly)
exposure summary

Denominator rules (explicit):

facilityPct = full-credit count / scoredAttempts
avgScorePct = mean over scoredAttempts
optionPct + topPatterns.pct = share of scoredAttempts
omitRate = exempt / attempts
invalidRate = invalid / attempts

MRQ errorModes note: counts may overlap (a single attempt can increment multiple error modes).

Permissions

Treat as reporting-grade:

reporting.view for list/detail (consistent with reporting capability tightening already in the system)

6. Report catalogue mapping

v1 Contextual Intelligence

Question Bank Health (Premium) (this section)

Deferred psychometrics

discrimination indices
reliability coefficients (alpha/KR-20)
IRT

7. Compliance integration (closing the ADR-0011 gap)

Reporting adds new tables containing user_id, therefore:

“Forget User (Hybrid Scrub)” does not delete these reporting rows (they contain no direct identifiers).
Any DSAR/export or UI identity display joins to users at query time, so scrubbed identity is respected.
If any reporting table ever stores free-text fields (avoid in v1), it must participate in the same redaction approach used for submission answer JSON.

8. Acceptance criteria (add Question Health)

Question Bank list shows Health badges per question version with:
- facility %, omit rate, median time, N, top reason
Drilldown provides:
- option distribution (MCQ), drift vs run_label / time buckets, exposure
Heuristic flags:
- confidence-gated (MED/HIGH)
- labelled as “heuristic” (not psychometrics)
Remediation re-projection updates rollups correctly without duplication.
All reads/writes are TxManager + RLS scoped.
Reporting uses ReportingStore; v1 targets primary DB, replica routing is a config change.
Reporting tables do not denormalise user PII (no names/emails/metadata copies).

0. Purpose & non-goals​

Purpose​

Non-goals (v1)​

1. Core invariants (already true in Evalium)​

2. Architecture decision: “calmer middle ground” (no separate reporting DB in v1)​

v1 deployment​

v1 seam for scaling (intentionally built in)​

3. What ships in v1​

3.1 Central Reports Hub (minimal, high value)​

Evaluation summary (v1)​

Session attempt report (v1)​

3.2 Contextual Intelligence (in-workflow)​

4. Reporting read-model strategy (v1)​

4.1 Projection model​

4.2 “No PII denormalisation” rule (compliance alignment)​

5.4 Contextual Intelligence v1 — Question Health (Premium)​

5.4.1 Purpose​

5.4.2 Unit of analysis​

5.4.3 Data sources​

5.4.4 Read-model tables (v1)​

A) reporting.report_item_attempts_flat (fact table)​

B) reporting.report_item_attempt_choice (typed facts)​

C) reporting.report_question_health_rollup (Question Bank UI)​

B) reporting.report_question_health_rollup (Question Bank UI)​

D) Optional: reporting.report_question_health_by_run​

5.4.5 Projection + update semantics​

Events​

Operational truth vs snapshot truth​

5.4.6 Premium heuristics (explicitly not psychometrics)​

Confidence levels​

Core flags​

MCQ option intelligence flags​

Drift flags​

Integrity signal (optional but premium)​

5.4.7 UI-ready API contracts (BFF)​

List (Question Bank overlay)​

Export (CSV)​

Detail (Drilldown)​

Permissions​

6. Report catalogue mapping​

v1 Contextual Intelligence​

Deferred psychometrics​

7. Compliance integration (closing the ADR-0011 gap)​

8. Acceptance criteria (add Question Health)​

0. Purpose & non-goals

Purpose

Non-goals (v1)

1. Core invariants (already true in Evalium)

2. Architecture decision: “calmer middle ground” (no separate reporting DB in v1)

v1 deployment

v1 seam for scaling (intentionally built in)

3. What ships in v1

3.1 Central Reports Hub (minimal, high value)

Evaluation summary (v1)

Session attempt report (v1)

3.2 Contextual Intelligence (in-workflow)

4. Reporting read-model strategy (v1)

4.1 Projection model

4.2 “No PII denormalisation” rule (compliance alignment)

5.4 Contextual Intelligence v1 — Question Health (Premium)

5.4.1 Purpose

5.4.2 Unit of analysis

5.4.3 Data sources

5.4.4 Read-model tables (v1)

A) `reporting.report_item_attempts_flat` (fact table)

B) `reporting.report_item_attempt_choice` (typed facts)

C) `reporting.report_question_health_rollup` (Question Bank UI)

B) `reporting.report_question_health_rollup` (Question Bank UI)

D) Optional: `reporting.report_question_health_by_run`

5.4.5 Projection + update semantics

Events

Operational truth vs snapshot truth

5.4.6 Premium heuristics (explicitly not psychometrics)

Confidence levels

Core flags

MCQ option intelligence flags

Drift flags

Integrity signal (optional but premium)

5.4.7 UI-ready API contracts (BFF)

List (Question Bank overlay)

Export (CSV)

Detail (Drilldown)

Permissions

6. Report catalogue mapping

v1 Contextual Intelligence

Deferred psychometrics

7. Compliance integration (closing the ADR-0011 gap)

8. Acceptance criteria (add Question Health)