← Work / Active files

[ CASE FILE / MIGHTY-MOUSE ]

Active

Mighty Mouse

An evaluation-driven harness for finding faster, more reliable ways to run small local coding models.

Role
Creator, researcher, and systems engineer
Discipline
Agent evaluation / local AI systems
Year
2026
Strongest proof
30 trials / Paired validation
Mighty MouseTests ways to make small AI models faster29.5% faster / 30 trials
LifeOpsTelegram agent that organizes plans and knowledgeUsed every day
Studio FinderSearch and compare LA recording studiosLive directory

Small models can be capable and still fail the workflow.

Local coding models are attractive because they are private, inexpensive, and available without a network dependency. The hard part is not getting one impressive answer; it is producing useful changes repeatedly under real file, format, scope, and verification constraints.

Mighty Mouse turns prompt and harness decisions into controlled experiments. The same tasks run through competing configurations, and promotion depends on verified outcomes rather than subjective output quality.

The harness is the product; the models are replaceable inputs.

I designed and built the orchestration, task fixtures, response parsing, benchmark runners, telemetry, promotion gates, safety checks, and command-line interface.

The system runs third-party local model runtimes and foundation models. I am not claiming authorship of those models; the work is in making their behavior measurable, configurable, and operationally useful.

Declared ownership

Original evaluation harness, benchmark workflow, protocol experiments, verification gates, and CLI. Model runtimes and foundation models are third-party systems.

How the work moves.

  1. Input

    Task fixture

    A scoped coding problem with expected files and assertions.

  2. Input

    Candidate config

    Model, protocol, prompt format, limits, and tool policy.

  3. State

    Isolated workspace

    Each run receives controlled files and hidden evaluator state.

  4. Decision

    Runner + parser

    Executes the candidate and converts output into a deterministic change.

  5. Decision

    Verification gate

    Tests, schema, scope, and mandatory-format checks decide success.

  6. Output

    Promotion record

    Latency, tokens, pass rate, and regressions become an auditable decision.

  7. Failure path

    Abort / quarantine

    Parser errors, unsafe scope, or failed tests stop promotion.

Decisions that changed the system.

Selected implementation record.

01 / CONTROL

Reproducible task suite

Baseline and candidate configurations execute the same versioned tasks in isolated workspaces.

02 / OBSERVE

Per-run telemetry

Latency, token use, parser state, changed files, assertions, and failure categories are captured.

03 / DECIDE

Promotion protocol

Candidates must retain reliability, pass replay gates, and clear safety checks before becoming defaults.

04 / OPERATE

CLI and diagnostics

A compact command surface runs doctor checks, demos, benchmarks, and result inspection.

Claims with their edges left on.

HS-7VERIFIED RECORD
30 trials

Paired validation

15 baseline and 15 Lean Protocol runs across the promotion suite.

Sourceeval/results/PROMOTION_NOTES.md — Phase 36 validation

HS-7VERIFIED RECORD
100% / 100%

Verified reliability

Both cohorts passed every verification gate in the 30-trial validation.

Sourceeval/results/PROMOTION_NOTES.md — Phase 36 validation

LimitThis result applies to the recorded benchmark suite, configuration, and environment—not every model or coding task.

HS-7VERIFIED RECORD
29.5%

Average speedup

Average latency improvement for Lean versus baseline across the paired validation.

Sourceeval/results/PROMOTION_NOTES.md — Phase 36 validation

HS-7VERIFIED RECORD
~45%

Separate mini-spike

Average execution-time reduction across three fixed-environment tasks in an earlier, smaller spike.

Sourceeval/results/spike_summary.md — clean Trial 3

LimitA three-task exploratory result reported separately from the 30-trial promotion validation.

What failed stays in the record.

resolved

Decomposition produced parser-contract failures

Observed
An early decomposition strategy returned output the harness could not apply, producing zero-second parser failures.
Response
Standardized stdout, isolated hidden state, added metadata fallbacks, and captured per-pass telemetry before reevaluation.
monitoring

One network task regressed under Lean

Observed
The validation watchlist recorded a 37.6% latency regression for task_005 even though the task passed.
Response
Kept the result visible as a monitoring item rather than averaging it away.
open

Broader local-model claims remain unproven

Observed
Current evidence is tied to the benchmark suite and tested configurations.
Response
Expand model and task diversity before claiming general performance improvements.

Active research system with a usable operator surface.

The evaluation harness, Lean promotion record, diagnostics, benchmark runners, and CLI are working. Research continues around model diversity, decomposition, autoresearch, and identifying configurations that generalize beyond a single suite.

Open to the right work

A useful system starts with the actual problem.

Full-time product, automation, and technical-operations roles are the priority. Select workflow and creative-technology projects are open.