Reproducible task suite
Baseline and candidate configurations execute the same versioned tasks in isolated workspaces.
[ CASE FILE / MIGHTY-MOUSE ]
ActiveAn evaluation-driven harness for finding faster, more reliable ways to run small local coding models.
The messy workflow
Local coding models are attractive because they are private, inexpensive, and available without a network dependency. The hard part is not getting one impressive answer; it is producing useful changes repeatedly under real file, format, scope, and verification constraints.
Mighty Mouse turns prompt and harness decisions into controlled experiments. The same tasks run through competing configurations, and promotion depends on verified outcomes rather than subjective output quality.
Ownership boundary
I designed and built the orchestration, task fixtures, response parsing, benchmark runners, telemetry, promotion gates, safety checks, and command-line interface.
The system runs third-party local model runtimes and foundation models. I am not claiming authorship of those models; the work is in making their behavior measurable, configurable, and operationally useful.
Original evaluation harness, benchmark workflow, protocol experiments, verification gates, and CLI. Model runtimes and foundation models are third-party systems.
System map
A scoped coding problem with expected files and assertions.
Model, protocol, prompt format, limits, and tool policy.
Each run receives controlled files and hidden evaluator state.
Executes the candidate and converts output into a deterministic change.
Tests, schema, scope, and mandatory-format checks decide success.
Latency, tokens, pass rate, and regressions become an auditable decision.
Parser errors, unsafe scope, or failed tests stop promotion.
Hard calls
Build record
Baseline and candidate configurations execute the same versioned tasks in isolated workspaces.
Latency, token use, parser state, changed files, assertions, and failure categories are captured.
Candidates must retain reliability, pass replay gates, and clear safety checks before becoming defaults.
A compact command surface runs doctor checks, demos, benchmarks, and result inspection.
Verification ledger
15 baseline and 15 Lean Protocol runs across the promotion suite.
Sourceeval/results/PROMOTION_NOTES.md — Phase 36 validation
Both cohorts passed every verification gate in the 30-trial validation.
Sourceeval/results/PROMOTION_NOTES.md — Phase 36 validation
LimitThis result applies to the recorded benchmark suite, configuration, and environment—not every model or coding task.
Average latency improvement for Lean versus baseline across the paired validation.
Sourceeval/results/PROMOTION_NOTES.md — Phase 36 validation
Average execution-time reduction across three fixed-environment tasks in an earlier, smaller spike.
Sourceeval/results/spike_summary.md — clean Trial 3
LimitA three-task exploratory result reported separately from the 30-trial promotion validation.
Failure + iteration
Current status
The evaluation harness, Lean promotion record, diagnostics, benchmark runners, and CLI are working. Research continues around model diversity, decomposition, autoresearch, and identifying configurations that generalize beyond a single suite.
Open to the right work
Full-time product, automation, and technical-operations roles are the priority. Select workflow and creative-technology projects are open.