Skip to main content

Reporting v1 Implementation Spec — Update: Premium Question Health

0. Purpose & non-goals

Purpose

Deliver sub-second, sliceable reporting for SMB admins without compromising auditability, by projecting immutable attempt data into flattened, RLS-scoped read-model tables, and exposing UI-ready BFF endpoints.

Non-goals (v1)

  • Full psychometrics (discrimination indices, alpha/KR-20, IRT)
  • External warehouse / dedicated analytics DB
  • Multi-org “as-of” identity repair workflows (parked)

1. Core invariants (already true in Evalium)

  1. version_snapshot is canonical for reporting. Never rebuild structure/tags by joining live authoring tables.

  2. Write-time metrics. Correctness/scoring are computed at submit-time, not report-time.

  3. RLS + TxManager scope injection is mandatory. All reporting reads/writes must run inside TxManager transactions with SET LOCAL scope variables.

  4. Remediation is ledger-like + idempotent. Scores evolve via append-only score versions + “latest” pointer semantics, and apply is idempotent.

  5. Compliance uses anonymisation (Hybrid Scrub). User identity is scrubbed in users, submissions keep user_id, and submission answer JSON may be redacted in-place for free-text fields.


2. Architecture decision: “calmer middle ground” (no separate reporting DB in v1)

v1 deployment

  • Reporting read-model tables live in the primary Postgres (same cluster), but:

    • they’re append/UPSERT-heavy, not query-heavy
    • query paths hit only the flattened tables, not the OLTP join graph
    • indexes are designed specifically for the reporting query patterns

v1 seam for scaling (intentionally built in)

Introduce a small abstraction:

  • ReportingStore interface (used by ReportingService handlers)

  • two implementations:

    • ReportingStorePrimary (default in v1)
    • ReportingStoreReplica (drop-in later)

Reporting uses ReportingStore; v1 targets the primary DB, and reads can be routed to a replica later via configuration without handler changes.

When you decide “scale is enough”, you only:

  • create a replica
  • point ReportingStoreReplica DSN at it
  • keep exact same schema + queries (replication copies the tables)

This avoids “architecture rewrite anxiety” without paying the full operational complexity on day 1.


3. What ships in v1

3.1 Central Reports Hub (minimal, high value)

  • Evaluation summary (access points, attempts completed, pass rate, avg/median score, completion time)
  • Session Attempt Report (per-session reconstruction: structure + tags + item attempts + telemetry rollups)
    • GET /session-attempts/{submissionId} (submission-based; attempt.id equals submission.id)
    • CSV export: GET /session-attempts/{submissionId}/export

Evaluation summary (v1)

GET /evaluation-summary

CSV export is available via GET /evaluation-summary/export (same filters; single-row CSV).

Version-scoped summary for a single evaluation_version_id (default v1).

Counters (no silent collapsing):

  • assigned_individuals_n → count of assignments with target type user_id or email
  • assigned_groups_n → count of group assignments
  • public_links_n → count of public-link assignments
  • assignments_n → total assignments (sum of the above)
  • unique_invitees_n → distinct target_ref for user_id/email assignments (lowercased)

Completion/graded metrics:

  • completions_n = submissions.status = completed AND completed_at IS NOT NULL (attempt-based)
  • pass_rate = pass / (pass + fail) using submissions.outcome_code (attempt-based)
  • pass_n, fail_n, graded_n, ungraded_n are returned explicitly (attempt-based)
  • outcomes.outcomeKnownAttemptsN + outcomes.outcomeKnownRate show outcome coverage

Timing (v1):

  • completion time uses submissions.completed_at - submissions.started_at
  • v1 returns avg + median if available; no p90 yet
  • timing.timeKnownAttemptsN + timing.timeKnownRate show coverage (rate over completed attempts)

Integrity rollup (v1):

  • integrity.telemetryKnownAttempts + integrity.telemetryKnownRate show telemetry coverage
  • integrity.sessionsWithFocusLost / integrity.sessionsWithFullscreenExit count attempts with those signals
  • integrity.rateDenominator = attempts_completed (rates are computed over completed attempts)

Scores (v1):

  • scores.scoreKnownAttemptsN + scores.scoreKnownRate show score coverage (graded attempts only)
  • scores.scoreDenominator = attempts_graded
  • scores.scoreKnownAttemptsN == outcomes.attemptsGraded (hard invariant)
  • scores.histogramSpec documents bucket ranges + labels for the histogram (UI-ready)

Scores (provisional):

  • scoresProvisional uses scored attempts (score/max_score known), even if outcome is ungraded
  • scoresProvisional.scoreDenominator = attempts_scored
  • scoresProvisional.scoredButUngradedN surfaces graded gaps without frontend math
  • scoresProvisional.histogramSpec mirrors scores.histogramSpec

Filters:

  • required: evaluationVersionId
  • optional: runLabel, groupId (assignment target group), from, to (completed_at window)
  • optional: scope=allVersions (aggregate across all versions for the evaluation of the supplied evaluationVersionId)

UI-ready blocks (v1, blocks-only contract):

  • scope echoes the filter context (evaluation version, runLabel, groupId, from/to).
    • scope.filtersApplied mirrors the applied runLabel/groupId/from/to filters.
  • hero is the canonical header block (attempt completion + pass rates, score/timing).
    • hero.accessPoints mirrors population.accessPoints.
    • hero.completionRate only when population coverage is full; otherwise null.
  • hero.completionRateDenominator + hero.completionRateDisplay are always populated.
  • hero.provisionalScoreAvailable + hero.provisionalScoreNote surface provisional score gaps without UI logic.
  • hero.scoreModeDefault + hero.scoreShownDenominator make the default score view explicit.
  • population reports access-point counts and population coverage (full|partial|unknown).
  • attempts reports attempt counts + unique users (to make “multiple attempts” explicit).
  • distribution splits assignment targets + distributionCount.
  • funnel returns access points → started → completed with inProgress, expired, notStarted, and an attempt-based completionRate (denominator = attempts_started).
    • If population.coverage != full, funnel.notStarted is null with notStartedDisplay + notStartedReason.
  • outcomes.passRateDenominator = attempts_graded (attempt-based pass/fail counts).
  • scores includes a coarse histogram and scoreDenominator = attempts_graded (plus histogramSpec).
  • outcomes, scores, scoresProvisional, timing, activity group core metrics + time source + timing/score/outcome coverage.
  • attention includes metrics + thresholds for explainability (no frontend rules).
  • attention provides low/medium/high + reasons for quick triage.
  • actionsAllowed + links keep the frontend permission-free (links return route params, not URL strings).
  • completionMethods returns attempt counts by completion method plus per-method pass/fail rollups.

Group list view (v1):

  • GET /evaluation-summary/groups
  • Cohort semantics are assignment-target groups only (no live membership expansion).
  • Returns a thin DTO by default (hero + attempts + outcomes + attention) with optional include= for scores, timing, distributions, tags, sections.

Cohort list view (v1):

  • GET /evaluation-summary/cohorts
  • Cohorts are run-label buckets (assignment/submission run_label), not dynamic membership groups.
  • Returns a thin DTO by default (hero + attempts + outcomes + attention) with optional include= for scores, timing, distributions, tags, sections.
  • Unlabeled attempts appear with runLabel = null and label = "Unlabeled" in the response.

Group list export (v1):

  • GET /evaluation-summary/groups/export
  • CSV export for the group list, using the same filters as the list endpoint.

Cohort list export (v1):

  • GET /evaluation-summary/cohorts/export
  • CSV export for the cohort list, using the same filters as the list endpoint.

Session attempt report (v1)

GET /session-attempts/{submissionId}

Per-submission reconstruction for admins:

  • summary is UI-ready (duration, total/answered/omitted, score pct, attempt status, completion method, latest score version).
  • timing includes started/ended timestamps, duration, duration source, total time-on-items, and idle estimate (max(durationMs - totalTimeOnItemsMs, 0) when duration source is known).
  • integrity includes derived flags + plain-English summary text + attention level + telemetry counts + computed metrics.
  • actionsAllowed + links provide UI-ready affordances (remediation, evaluation summary, submission, user profile) without frontend capability checks.
  • tagAggregates, tagKeySummaries, and sectionAggregates provide per-attempt rollups (facility, omit, time, avg score).
  • submission includes immutable versionSnapshot (structure + tags) plus score/outcome (no sessionId field in this report).
  • session is attempt context (submission-based); session.id equals submission.id and includes timings (startedAt, endedAt, lastActivityAt), run label, and overrides.
  • items include answer payload, scoring fields, timing, and UI helpers (isOmitted, selectedChoiceId, displayAnswer, tagList, stemText, choice texts, scoring explanation).
  • integrity.telemetrySummary aggregates counts from reporting.submission_telemetry_summary (durable, no raw events).

Rules:

  • Uses submissions.version_snapshot as canonical structure (no joins to live authoring).
  • No raw telemetry events in v1; rollups only.
  • Guarded by reporting.view_named.

3.2 Contextual Intelligence (in-workflow)

  • Question Health (Premium) in the Question Bank: Health badges + drilldown per question_version_id, powered by flattened attempt facts and rollups (facility, omit, time, option intel, drift, exposure, integrity signals). Uses immutable snapshots + structured tags for defensibility.

4. Reporting read-model strategy (v1)

4.1 Projection model

A background worker consumes events (submission finalised, remediation applied) and projects into reporting tables.

  • Source of truth:

    • submissions.version_snapshot
    • submission_items
    • “operational truth” score selection from latest score version semantics
  • Admin/backfill trigger:

    • POST /api/v1/reporting/projections enqueues submission/range/remediation/telemetry jobs (guarded by reporting.view_named).
  • Durability rule:

    • Reporting read-model tables are submission-based and must not depend on delivery_sessions (no joins/FKs).

4.2 “No PII denormalisation” rule (compliance alignment)

Reporting tables may store:

  • user_id (OK) They must not store:
  • name, email, or users.metadata copies

Any UI that needs display identity joins JIT to users (which is already scrubbed by compliance).


5.4 Contextual Intelligence v1 — Question Health (Premium)

5.4.1 Purpose

Give Authors/Admins an at-a-glance view of whether questions are behaving well, plus a drilldown that feels “enterprise” without claiming full psychometrics.

Why it’s safe in Evalium:

  • version_snapshot freezes the delivered structure + tags
  • scoring/correctness are computed at submit-time
  • integrity signals can be added from durable submission telemetry summaries when available

5.4.2 Unit of analysis

All metrics computed per question_version_id.

5.4.3 Data sources

Primary (defensible)

  • submissions.version_snapshot → tags + delivered structure
  • submission_items (responses + timestamps)
  • latest score semantics for post-remediation operational truth

Typed analysis (MCQ/MRQ)

  • reporting.report_item_attempt_choice (selected options, patterns, partial credit flags)

5.4.4 Read-model tables (v1)

A) reporting.report_item_attempts_flat (fact table)

1 row per item attempt (keyed by submission_item_id), written by projection worker.

Minimum columns:

  • tenant_id, org_unit_id (RLS scoped)
  • submission_id, submission_item_id, user_id
  • completed_at
  • evaluation_id, evaluation_version_id
  • run_label
  • question_version_id
  • qtype
  • is_correct, score_awarded, max_score
  • score_status, score_method, invalid_code
  • selected_option (MCQ only; nullable)
  • choice_ids (raw array; nullable to keep MCQ-multi flexible later)
  • is_omitted, omit_reason
  • time_on_item_ms (nullable; prefer time_spent_ms, then answered_at - started_at)
  • structured_tags jsonb (copied from snapshot for fast slicing)

Rule: never re-join to live authoring tables to rebuild tags/structure.

B) reporting.report_item_attempt_choice (typed facts)

1 row per item attempt (keyed by submission_item_id), used for MCQ/MRQ analysis panels.

Minimum columns:

  • tenant_id, org_unit_id
  • submission_item_id
  • question_version_id, completed_at
  • selected_option_ids (array)
  • response_pattern_hash (nullable; MRQ patterns)
  • is_full_credit, is_partial_credit
  • correct_selected, incorrect_selected, total_correct
  • constraint_violation_code (nullable)

C) reporting.report_question_health_rollup (Question Bank UI)

B) reporting.report_question_health_rollup (Question Bank UI)

1 row per question_version_id, incrementally updated.

Minimum columns:

  • keys: (tenant_id, org_unit_id, question_version_id)
  • counts: attempts_n, correct_n, omitted_n
  • status counts: scored_n, pending_n, invalid_n, exempt_n
  • facility: facility_pct
  • time: avg_time_ms, median_time_ms, p90_time_ms
  • scoring: avg_score_awarded, avg_score_pct
  • option intel: option_counts jsonb, option_pct jsonb
  • drift: facility_7d_pct, facility_30d_pct, facility_last_run_pct
  • exposure: evaluations_count, recent_runs jsonb
  • badges: health_status, confidence_level, top_reasons jsonb, flags jsonb
  • last_computed_at

Facility definition: facility uses scored attempts only (correct / scored where score_status = SCORED).

Percentiles note (pragmatic “premium”):

  • Don’t try to maintain exact percentiles incrementally on every event.
  • Store avg_time_ms incrementally.
  • Recompute median_time_ms / p90_time_ms via a low-frequency job (e.g., hourly) per tenant/org “dirty set” of question versions.
  • For manual runs, use POST /api/v1/reporting/percentiles/recompute (guarded by reporting.view_named).

That keeps the UI premium without making projections expensive.

D) Optional: reporting.report_question_health_by_run

Keyed by (tenant_id, org_unit_id, question_version_id, run_label):

  • attempts_n, facility_pct, omit_rate, median_time_ms, option_pct jsonb

This makes drift/cohort comparisons instant.

5.4.5 Projection + update semantics

Events

  • On submission finalisation: project attempt rows + update rollups.
  • On remediation apply: re-project affected submissions so rollups reflect latest operational truth.

Operational truth vs snapshot truth

  • Health v1 defaults to operational truth (latest score).
  • Later you can add an “audit mode” toggle pinning to score version at time-of-attempt (not needed v1).
  • Scoring aggregates (meanScorePct, histograms, trends) use score_status = SCORED only.
    • Omitted attempts are treated as EXEMPT and are excluded from scored aggregates by default.

5.4.6 Premium heuristics (explicitly not psychometrics)

Confidence levels

  • LOW: N < 30
  • MED: 30 ≤ N < 100
  • HIGH: N ≥ 100

Core flags

(activate only when confidence MED/HIGH unless stated)

  • TOO_EASY: facility ≥ 0.90 and N ≥ 30

  • TOO_HARD: facility ≤ 0.20 and N ≥ 30

  • HIGH_OMIT: omit ≥ 0.10 and N ≥ 30

  • TIME_SPIKE: median_time ≥ (baseline × 2) and N ≥ 30

    • baseline v1 fallback: compare 7d vs 30d for the same question_version_id

MCQ option intelligence flags

  • NON_FUNCTIONING_DISTRACTOR: distractor < 2% and N ≥ 50
  • DISTRACTOR_DOMINANCE: one wrong option ≥ 50% and facility ≤ 50% and N ≥ 50
  • SPLIT_DISTRACTORS: two wrong options ≥ 25% each and facility ≤ 60% and N ≥ 50

Drift flags

  • FACILITY_DRIFT_UP: last_run − 30d ≥ +0.20 and attempts_last_run ≥ 20
  • FACILITY_DRIFT_DOWN: last_run − 30d ≤ −0.20 and attempts_last_run ≥ 20

Integrity signal (optional but premium)

  • HIGH_FOCUS_LOST_RATE: sessions with focus-lost / attempts ≥ 0.25 and N ≥ 50

5.4.7 UI-ready API contracts (BFF)

List (Question Bank overlay)

GET /question-health

Supports:

  • tag filters from snapshot tags (topic/skill/difficulty…)
  • time range, evaluation_version_id, run_label
  • sorting: needs_attention_first, highest_omit, most_drift, most_exposed

Returns per row:

  • questionVersionId
  • healthBadge { status, confidence, topReasons[] }
  • core { attempts, scoredAttempts, correct, omitted, facilityPct, omitRate, medianTimeMs, statusCounts, invalidRate }
  • analysis.choiceSingle.topOption (MCQ single only)

Export (CSV)

GET /question-health/export

Same filters as the list endpoint. Returns a ZIP bundle for offline analysis (no PII; uses rollup data only).

Bundle contents:

  • question_health_core.csv (always)
  • question_health_choice_single.csv + choice_single_option_breakdown.csv (MCQ single only)
  • question_health_choice_multi.csv + choice_multi_option_breakdown.csv + choice_multi_pattern_breakdown.csv + choice_multi_constraint_breakdown.csv (MRQ only)
  • manifest.json (export version + file row counts)

Detail (Drilldown)

GET /question-versions/{questionVersionId}/health

Returns:

  • core rollup metrics (counts, timing, scoring, statusCounts, exposure, drift, badges)
  • analysis.choiceSingle (MCQ single option distribution)
  • analysis.choiceMulti (MRQ inclusion rates, patterns, partial credit histogram, error modes, constraint violations)
  • per-run comparisons (if by_run table exists)
  • drift timeline buckets (weekly)
  • exposure summary

Denominator rules (explicit):

  • facilityPct = full-credit count / scoredAttempts
  • avgScorePct = mean over scoredAttempts
  • optionPct + topPatterns.pct = share of scoredAttempts
  • omitRate = exempt / attempts
  • invalidRate = invalid / attempts

MRQ errorModes note: counts may overlap (a single attempt can increment multiple error modes).

Permissions

Treat as reporting-grade:

  • reporting.view for list/detail (consistent with reporting capability tightening already in the system)

6. Report catalogue mapping

v1 Contextual Intelligence

  • Question Bank Health (Premium) (this section)

Deferred psychometrics

  • discrimination indices
  • reliability coefficients (alpha/KR-20)
  • IRT

7. Compliance integration (closing the ADR-0011 gap)

Reporting adds new tables containing user_id, therefore:

  • “Forget User (Hybrid Scrub)” does not delete these reporting rows (they contain no direct identifiers).
  • Any DSAR/export or UI identity display joins to users at query time, so scrubbed identity is respected.
  • If any reporting table ever stores free-text fields (avoid in v1), it must participate in the same redaction approach used for submission answer JSON.

8. Acceptance criteria (add Question Health)

  • Question Bank list shows Health badges per question version with:

    • facility %, omit rate, median time, N, top reason
  • Drilldown provides:

    • option distribution (MCQ), drift vs run_label / time buckets, exposure
  • Heuristic flags:

    • confidence-gated (MED/HIGH)
    • labelled as “heuristic” (not psychometrics)
  • Remediation re-projection updates rollups correctly without duplication.

  • All reads/writes are TxManager + RLS scoped.

  • Reporting uses ReportingStore; v1 targets primary DB, replica routing is a config change.

  • Reporting tables do not denormalise user PII (no names/emails/metadata copies).