Evalium Results Foundation
This document captures the backend-first foundations we agreed to before building the next wave of scoring, feedback, and analytics features. The goal is to design the “evaluation brain” now so the SvelteKit front end can gradually light it up without forcing backend rewrites.
1. Runtime Model & Data Capture
- Treat every delivery as a durable fact stream.
delivery_sessions: per candidate per evaluation version (seed, mode, start/end, device/IP).submissions: final hand-in snapshot (status, score cache, routing decisions).submission_items: append-only per question event (answer payload, correctness, points, started_at, answered_at, time_spent_ms, num_checks, audit metadata).
- Store raw data even if we only expose aggregate score today; future analytics (psychometrics, timelines, anomaly detection) reuse the same facts.
2. Configurable Evaluation “Brain”
- Attach rich configs to
evaluation_versionsso the backend enforces rules and the UI simply toggles what’s visible.- ScoringConfig: weights (per question/tag/section), pass/fail bands, partial credit rules, deductions, optional negative scoring.
- FeedbackPolicy: when to show feedback (per item check-answer vs. post-submission), what content (correct/incorrect, rationale, next steps), limits on number of checks.
- ExposurePolicy: when answers/rationales may ever be revealed (never, after publish, after expiry, after N attempts, admin-only).
- Mode: exam, practice, training quiz, survey—drives defaults for the above policies.
- Config lives in JSONB (or dedicated tables) so we can add options without migrations; frontend can expose subsets over time.
3. Pluggable Scoring Engine
- Define a scorer interface per question type:
Score(questionVersion, answer, scoringConfig) -> ItemResult. - All scoring runs server-side; question versions in cache/DB contain correct answers, weights, and rubrics. No answer-key data leaves the backend.
- “Check answer” simply invokes the same scorer on a single item and applies Feedback/ExposurePolicy before returning a response.
- New question types (coding challenges, drag/drop, audio) implement the same interface without changing the rest of the system.
4. Immediate Feedback as First-Class Mode
- Evaluation settings determine whether per-item feedback is allowed, how many times “check answer” can be used, and what is revealed (just correctness vs. rationale vs. remediation link).
- Backend logs each check attempt (to power audit/analytics) and stores the outcome so total scoring remains consistent.
- Modes such as “exam” disallow check-answer entirely; “practice” enables it with author-defined limits.
5. Results & Stakeholder Views
- Separate layers:
- Raw data (
submission_items). - Canonical results (aggregated overall score, per tag/section, pass/fail, warnings). Cached in
submissions. - Stakeholder projections:
- Candidate view (respects Feedback/ExposurePolicy).
- Manager/HR view (limited insights per organization policy).
- Admin/analyst view (full detail, psychometrics, audit).
- Raw data (
/resultsAPI returns a rich shape from day one (overall, per-section, per-tag, per-item arrays, feedback slots). Frontend can start by showing overall score and gradually reveal more without backend changes.
6. Timing & Event Context
- Capture
started_at,answered_at,time_spent_ms, focus events, device/IP info automatically. Even if the initial UI only sends coarse timestamps, the schema supports future enhancements (e.g., cheating detection, slow/fast outliers, accessibility adjustments).
7. Testing Philosophy
- Continue writing
.shregression suites per milestone (e.g.,evaluations_validate.sh, upcomingsubmissions_results.sh). - Populate scripts with realistic flows: create questions with tags, publish versions, deliver evaluations, submit answers, request results.
- Scripts serve as both integration tests and living documentation for API behavior.
With these foundations in place, we can implement instant scoring, per-item feedback, analytics, certificates, and automation without refactoring the core model. The frontend will simply consume more of the existing capabilities as we expose them.