Evalium Results Foundation

This document captures the backend-first foundations we agreed to before building the next wave of scoring, feedback, and analytics features. The goal is to design the “evaluation brain” now so the SvelteKit front end can gradually light it up without forcing backend rewrites.

1. Runtime Model & Data Capture

Treat every delivery as a durable fact stream.
- delivery_sessions: per candidate per evaluation version (seed, mode, start/end, device/IP).
- submissions: final hand-in snapshot (status, score cache, routing decisions).
- submission_items: append-only per question event (answer payload, correctness, points, started_at, answered_at, time_spent_ms, num_checks, audit metadata).
Store raw data even if we only expose aggregate score today; future analytics (psychometrics, timelines, anomaly detection) reuse the same facts.

2. Configurable Evaluation “Brain”

Attach rich configs to evaluation_versions so the backend enforces rules and the UI simply toggles what’s visible.
- ScoringConfig: weights (per question/tag/section), pass/fail bands, partial credit rules, deductions, optional negative scoring.
- FeedbackPolicy: when to show feedback (per item check-answer vs. post-submission), what content (correct/incorrect, rationale, next steps), limits on number of checks.
- ExposurePolicy: when answers/rationales may ever be revealed (never, after publish, after expiry, after N attempts, admin-only).
- Mode: exam, practice, training quiz, survey—drives defaults for the above policies.
Config lives in JSONB (or dedicated tables) so we can add options without migrations; frontend can expose subsets over time.

3. Pluggable Scoring Engine

Define a scorer interface per question type: Score(questionVersion, answer, scoringConfig) -> ItemResult.
All scoring runs server-side; question versions in cache/DB contain correct answers, weights, and rubrics. No answer-key data leaves the backend.
“Check answer” simply invokes the same scorer on a single item and applies Feedback/ExposurePolicy before returning a response.
New question types (coding challenges, drag/drop, audio) implement the same interface without changing the rest of the system.

4. Immediate Feedback as First-Class Mode

Evaluation settings determine whether per-item feedback is allowed, how many times “check answer” can be used, and what is revealed (just correctness vs. rationale vs. remediation link).
Backend logs each check attempt (to power audit/analytics) and stores the outcome so total scoring remains consistent.
Modes such as “exam” disallow check-answer entirely; “practice” enables it with author-defined limits.

5. Results & Stakeholder Views

Separate layers:
1. Raw data (submission_items).
2. Canonical results (aggregated overall score, per tag/section, pass/fail, warnings). Cached in submissions.
3. Stakeholder projections:
  - Candidate view (respects Feedback/ExposurePolicy).
  - Manager/HR view (limited insights per organization policy).
  - Admin/analyst view (full detail, psychometrics, audit).
/results API returns a rich shape from day one (overall, per-section, per-tag, per-item arrays, feedback slots). Frontend can start by showing overall score and gradually reveal more without backend changes.

6. Timing & Event Context

Capture started_at, answered_at, time_spent_ms, focus events, device/IP info automatically. Even if the initial UI only sends coarse timestamps, the schema supports future enhancements (e.g., cheating detection, slow/fast outliers, accessibility adjustments).

7. Testing Philosophy

Continue writing .sh regression suites per milestone (e.g., evaluations_validate.sh, upcoming submissions_results.sh).
Populate scripts with realistic flows: create questions with tags, publish versions, deliver evaluations, submit answers, request results.
Scripts serve as both integration tests and living documentation for API behavior.

With these foundations in place, we can implement instant scoring, per-item feedback, analytics, certificates, and automation without refactoring the core model. The frontend will simply consume more of the existing capabilities as we expose them.

1. Runtime Model & Data Capture​

2. Configurable Evaluation “Brain”​

3. Pluggable Scoring Engine​

4. Immediate Feedback as First-Class Mode​

5. Results & Stakeholder Views​

6. Timing & Event Context​

7. Testing Philosophy​