[ CASE FILE / MIGHTY-MOUSE ]

Active

Mighty Mouse

An evaluation-driven harness for finding faster, more reliable ways to run small local coding models.

Role: Creator, researcher, and systems engineer
Discipline: Agent evaluation / local AI systems
Year: 2026
Strongest proof: 30 trials / Paired validation

Mighty MouseTests ways to make small AI models faster29.5% faster / 30 trials

LifeOpsTelegram agent that organizes plans and knowledgeUsed every day

Studio FinderSearch and compare LA recording studiosLive directory

The messy workflow

Small models can be capable and still fail the workflow.

Local coding models are attractive because they are private, inexpensive, and available without a network dependency. The hard part is not getting one impressive answer; it is producing useful changes repeatedly under real file, format, scope, and verification constraints.

Mighty Mouse turns prompt and harness decisions into controlled experiments. The same tasks run through competing configurations, and promotion depends on verified outcomes rather than subjective output quality.

Ownership boundary

The harness is the product; the models are replaceable inputs.

I designed and built the orchestration, task fixtures, response parsing, benchmark runners, telemetry, promotion gates, safety checks, and command-line interface.

The system runs third-party local model runtimes and foundation models. I am not claiming authorship of those models; the work is in making their behavior measurable, configurable, and operationally useful.

Declared ownership

Original evaluation harness, benchmark workflow, protocol experiments, verification gates, and CLI. Model runtimes and foundation models are third-party systems.

System map

How the work moves.

Input
Task fixture
A scoped coding problem with expected files and assertions.
Input
Candidate config
Model, protocol, prompt format, limits, and tool policy.
State
Isolated workspace
Each run receives controlled files and hidden evaluator state.
Decision
Runner + parser
Executes the candidate and converts output into a deterministic change.
Decision
Verification gate
Tests, schema, scope, and mandatory-format checks decide success.
Output
Promotion record
Latency, tokens, pass rate, and regressions become an auditable decision.
Failure path
Abort / quarantine
Parser errors, unsafe scope, or failed tests stop promotion.

Hard calls

Decisions that changed the system.

Build record

Selected implementation record.

01 / CONTROL

Reproducible task suite

Baseline and candidate configurations execute the same versioned tasks in isolated workspaces.

02 / OBSERVE

Per-run telemetry

Latency, token use, parser state, changed files, assertions, and failure categories are captured.

03 / DECIDE

Promotion protocol

Candidates must retain reliability, pass replay gates, and clear safety checks before becoming defaults.

04 / OPERATE

CLI and diagnostics

A compact command surface runs doctor checks, demos, benchmarks, and result inspection.

Verification ledger

Claims with their edges left on.

HS-7VERIFIED RECORD

30 trials

Paired validation

15 baseline and 15 Lean Protocol runs across the promotion suite.

Sourceeval/results/PROMOTION_NOTES.md — Phase 36 validation

HS-7VERIFIED RECORD

100% / 100%

Verified reliability

Both cohorts passed every verification gate in the 30-trial validation.

Sourceeval/results/PROMOTION_NOTES.md — Phase 36 validation

LimitThis result applies to the recorded benchmark suite, configuration, and environment—not every model or coding task.

HS-7VERIFIED RECORD

29.5%

Average speedup

Average latency improvement for Lean versus baseline across the paired validation.

Sourceeval/results/PROMOTION_NOTES.md — Phase 36 validation

HS-7VERIFIED RECORD

~45%

Separate mini-spike

Average execution-time reduction across three fixed-environment tasks in an earlier, smaller spike.

Sourceeval/results/spike_summary.md — clean Trial 3

LimitA three-task exploratory result reported separately from the 30-trial promotion validation.

Failure + iteration

What failed stays in the record.

resolved

Decomposition produced parser-contract failures

Observed: An early decomposition strategy returned output the harness could not apply, producing zero-second parser failures.
Response: Standardized stdout, isolated hidden state, added metadata fallbacks, and captured per-pass telemetry before reevaluation.

monitoring

One network task regressed under Lean

Observed: The validation watchlist recorded a 37.6% latency regression for task_005 even though the task passed.
Response: Kept the result visible as a monitoring item rather than averaging it away.

open

Broader local-model claims remain unproven

Observed: Current evidence is tied to the benchmark suite and tested configurations.
Response: Expand model and task diversity before claiming general performance improvements.

Current status

Active research system with a usable operator surface.

The evaluation harness, Lean promotion record, diagnostics, benchmark runners, and CLI are working. Research continues around model diversity, decomposition, autoresearch, and identifying configurations that generalize beyond a single suite.

Open to the right work

A useful system starts with the actual problem.

Full-time product, automation, and technical-operations roles are the priority. Select workflow and creative-technology projects are open.

Discuss a role Start a project

Small models can be capable and still fail the workflow.

The harness is the product; the models are replaceable inputs.

How the work moves.

Task fixture

Candidate config

Isolated workspace

Runner + parser

Verification gate

Promotion record

Abort / quarantine

Decisions that changed the system.

Selected implementation record.

Reproducible task suite

Per-run telemetry

Promotion protocol

CLI and diagnostics

Claims with their edges left on.

Paired validation

Verified reliability

Average speedup

Separate mini-spike

What failed stays in the record.

Decomposition produced parser-contract failures

One network task regressed under Lean

Broader local-model claims remain unproven

Active research system with a usable operator surface.

A useful system starts with the actual problem.