Smart Paste — Feasibility & Updates Addendum v1
Purpose: Capture changes/updates after feasibility review (sandboxed DOM extraction + Ohm) and fold them into the Smart Paste & Import plan without rewriting the whole spec.
Companion doc: “Smart Paste & Import — Evalium Spec v1”
1) What changed (Executive Summary)
- Feasibility: GREEN. We will adopt a Sandboxed DOM Extraction approach for HTML pastes (Excel/Word) and a compact Ohm grammar for structured text (MCQ v1), with safe fallbacks. This reframes Smart Paste as an integration task rather than research.
- Terminology: authoring feature is now “Paste → Table Block (HTML‑rendered)” to reflect Block JSON storage and semantic HTML rendering.
- Scope clarifications: A single pipeline powers Tables, Users/Groups/Assignments, and Bulk Question Import (MCQ); each reuses the same detection → mapping → validation → dry‑run → commit pattern.
2) Architectural updates
2.1 HTML Paste (Excel/Word) — SandboxDOMExtractor
- Default path: Parse using DOMParser into a detached Document (preferred, simplest).
- Fallback path: Use an
<iframe sandbox srcdoc="...">to let the browser repair malformed Office HTML in isolation. - Never inject raw pasted HTML. Extract a grid (cells with value +
rowspan/colspan), then map to a Table Block (Block JSON). - A11y enrichments: Prompt for caption and header semantics; inject
<th scope>metadata in the Block (renderer emits semantic HTML). - Brand & responsiveness: No inline styles; tokens for color/spacing; horizontal scroll + sticky header; optional stacked mode for narrow screens.
2.2 Text/CSV/Markdown & MCQ — Ohm parsers
- CSV/TSV/Markdown: Keep delimiter detection; prefer stable CSV/TSV path.
- MCQ v1 Ohm grammar: Lightweight grammar for rows with
stem, option_1..n, correct(index or label). If confidence < threshold, fall back to manual Column Mapping. - Versioning discipline: Bulk MCQ creates new
question_versions(draft) and never mutates published versions; optional assembly into a new evaluation draft.
2.3 Normalizers & Mappers
- GridNormalizer: flattens spans, trims empty rows/cols, infers header hints.
- RowNormalizer: trims cells; soft type inference (email/date/number/enum/boolean).
- Mappers:
- Grid → Table Block (Block JSON).
- Rows → User/Group/Assignment DTOs or DraftQuestion DTOs (MCQ).
3) Guardrails (must‑haves)
- Security: Strict allow‑list sanitisation; no event handlers, no classes/styles; CSP enforced; iframe sandbox when used.
- Limits: Size caps (cells/rows/chars); above thresholds, route to file import (same mapping UI).
- Privacy: Telemetry logs shapes/counts only (no raw PII or stems/options text).
- RLS/Capabilities: FE never authorises; backend enforces with transactions; emit partial‑failure reports.
4) Updated Acceptance Criteria (add to base spec)
4.1 Paste → Table Block (authoring)
- Pasting an Excel/Word table with merged cells produces a Table Block with correct structure, caption, and header scopes; rendered output is semantic HTML with no inline styles and passes the a11y lint.
- Oversized tables are blocked with a clear message and a CTA to use file import.
- Preview shows Original vs Cleaned; Undo works.
4.2 Data Importer (Users/Groups/Assignments)
- Paste → auto‑detect → mapping (with remembered presets) → validation → dry‑run shows Create/Update/Conflict counts and row‑level reasons; commit applies in one transaction.
- Duplicates and policy errors (e.g., invalid attempt_limit, bad UTC date) are flagged pre‑commit with actionable fixes.
4.3 Bulk Question Import (MCQ v1)
- Paste of rows with
stem, option_1..n, correctvalidates; dry‑run displays per‑row issues (missing stem, duplicate correct, too few options). - Commit creates draft question_versions only; no mutation of published content. Optional “assemble into evaluation draft” succeeds and returns the draft ID.
5) Test Matrix (expand)
Paste sources: Excel (Win/Mac), Google Sheets, Word/Google Docs tables.
Shapes: single‑row header; two‑row headers; row/col spans; internal blanks; wide tables; Unicode and RTL samples.
MCQ variants: correct as index vs label; options A–D vs 1–4; extra ignored columns; quotes/commas in stems/options.
Browsers: Chrome, Edge, Firefox, Safari (latest 2 versions) on Win/Mac.
6) Telemetry & Observability (detail)
- Client events:
paste_detected(format, rows, cols),auto_map_confidence,validation_failed(counts),dry_run_applied. - Server metrics: batch size, duration, success/fail counts, top error reasons; CSP report ingestion.
- Dashboards: adoption (% tenants using Smart Paste), time‑to‑publish with tables, importer error funnels.
7) Go/No‑Go Checklist (engineering ready)
- SandboxDOMExtractor handles “warzone” samples; DOMParser preferred path passes fixtures.
- OhmMCQ parses reference sets and falls back cleanly to manual mapping when confidence < threshold.
- Renderer emits semantic HTML; a11y lint passes; sticky headers work; mobile overflow/stacked mode verified.
- Transactions + RLS enforced for all imports; partial failures returned with row refs; telemetry is PII‑safe.
8) Open Decisions (for ADRs)
- Exact caps (cells/rows/chars) per plan tier; defaults for free vs paid.
- Confidence thresholds for OhmMCQ before fallback to manual mapping.
- Whether to allow per‑tenant shared presets for column mappings (default yes).
9) Roll‑in Plan
- Merge this addendum into the base spec as Sections: “Feasibility & Legacy Reuse,” “Architectural Updates,” “Updated Acceptance Criteria,” and “Test Matrix.”
- Create fixtures repo:
/fixtures/paste/html/,/fixtures/paste/csv/,/fixtures/paste/mcq/with golden outputs for CI.
10) Definition of Done (delta)
- All new acceptance criteria in §4 pass in CI with fixtures.
- CSP report-only trial shows zero violations in target paths; then enforced.
- Paste telemetry appears in dashboards; no PII present.