aidokitwiki

Hall of Pain

Real, documented failures of unsupervised AI coding agents. The point of this page is not to dunk on the underlying tools — they are improving fast — but to make the category of failure that aidokit exists to mitigate concrete, named, and searchable.

If you've experienced one of these failure modes in your own work, you are the target user for aidokit. The Standard and Strict tiers exist specifically to prevent the ones below.


Coverage status (v1.x) #

The 25-point analysis behind this page is now fully addressed by v1.x. The table below maps each pain category to the mechanism that mitigates it.

Category Status Mechanism
Context window limits 🟢 Strong-at-Standard Role separation, sub-agents, per-tier memory file
Context poisoning / drift ✅ Strong scratchpad-hygiene + reset-context skills (always-on); doctor --hygiene
Hallucinated APIs / symbols ✅ Strong Context7 MCP, Read-before-edit skill, byte-compare gate, zero-LLM scaffolder
Token cost explosion ✅ Strong aidokit verify --budget
Stale documentation ✅ Strong aidokit doctor --drift with semantic reference-graph check
Tech debt from AI sprawl 🟢 Strong-at-Standard Memory-file anti-patterns; scope discipline
Non-determinism ✅ Strong Zero-LLM scaffolder; hermetic CI; AIDOKIT_DETERMINISTIC=1
Verification gap 🟢 Strong-at-Standard Tester-Reviewer role; aidokit verify; change-summary
Security & supply-chain ✅ Strong .aidokit/capabilities.json; signed packages; five verify facets
Workflow fragmentation ✅ Strong Three first-party adapters; multi-adapter projects
Prompt injection (MCP) ✅ Strong untrustedOutput flag; quarantining-untrusted-output skill; doctor check; threat T18
Scope creep 🟢 Strong-at-Standard Immutable task briefs; memory-file rules
Architectural intent loss ✅ Strong First-class ADRs (0001–0020)
Onboarding cost ✅ Strong Per-tier memory file; shared docs/ skeleton
Skill atrophy ✅ Strong SP9 hygiene gates (doctor --hygiene)
Multi-agent chaos 🟢 Strong-at-Standard Fixed role contracts; single Maintainer merge
Evaluation difficulty ✅ Strong aidokit eval runs ## Acceptance criteria
License / IP contamination ✅ Strong aidokit verify --license
Secrets leakage ✅ Strong aidokit verify --secrets; reinforced by .aidoignore
Reproducibility across OSes ✅ Strong CI matrix Win/macOS/Linux; deterministic emission
Model / CLI version drift ✅ Strong .aidokit/model.lock; doctor --model-drift
Agent runaway loops ✅ Strong aidokit verify --loop-cap
Code confidentiality ✅ Strong .aidoignore; @aidokit/core/aidoignore matcher
Dependency bloat ✅ Strong aidokit verify --deps
AI-code provenance ✅ Strong Beads task IDs; change-summary; ADRs

Total: 21 ✅ Strong / 4 🟢 Strong-at-Standard / 0 partial / 0 unaddressed.

The 🟢 entries are tier-gated — strongest at Standard/Strict, present-as-convention at Minimum.


How to submit a story #

We collect failures here so future readers (and our own roadmap) see the real shape of the problem. Submission is one of three ways:

  1. GitHub issue — open one tagged hall-of-pain with the template below. The maintainers triage and add to this page within a week.
  2. Mailtofeedback@aidokit.dev, subject line "Hall of Pain: …".
  3. Pull request — edit wiki/hall-of-pain/index.md directly.

Submission template (paste into a GitHub issue, then redact):

Title:        <one-line summary>
Date:         <YYYY-MM-DD>
Tool:         <Claude Code | Cursor | Aider | Copilot | Codex | other>
Failure kind: <scope-leak | context-loss | tool-soup | runaway-loop | other>
Cost:         <hours lost / dollars / story-points / "trust">
What happened:
  <2–4 sentences. Public link if available.>
What `aidokit` would have done:
  <Honest answer. If "nothing useful", say so. We learn from those too.>
Permission:   <Public name OK / Anonymous / Redacted>

We attribute by name only with explicit permission. Anonymous stories are welcome and given equal prominence.


Failure taxonomy #

The stories below are organised by the kind of failure, not by the tool. The same tool can produce different failures in different conditions; the same failure can appear across multiple tools.

Scope leak — agent edits files it wasn't asked to touch #

Symptom: a feature branch contains unrelated formatting / refactor / "I noticed this was wrong" changes across files outside the task's stated scope. PR review takes 5× longer; revert is risky.

What aidokit does about it: the watchdog hook on file write checks the task's declared scope and refuses writes outside it. The agent receives a machine-readable refusal and can either re-plan or escalate the scope explicitly.

Stories: not yet contributed. Send yours.

Context loss — agent forgets a decision made earlier in the session #

Symptom: the agent re-discusses an architectural choice the team made two weeks ago, contradicts an ADR, or "improves" a pattern that was deliberately chosen for a reason it doesn't see.

What aidokit does about it: the Beads task graph and decision log (bd create --type message) survive session boundaries. The Architect role reads docs/decisions/ and docs/03-architecture-summary.md before proposing. Standard tier requires bd prime at session start.

Stories: not yet contributed. Send yours.

Tool soup — every agent has every tool #

Symptom: a code-review subagent has shell access and modifies files mid- review; a research subagent runs npm install; token budget explodes because every agent has the full MCP catalog loaded.

What aidokit does about it: per-role MCP scoping. The Researcher gets docs lookup; the Builder does not. The Tester-Reviewer is read-only and literally cannot edit files. Strict tier verifies capability declarations match the emitted hook configuration.

Stories: not yet contributed. Send yours.

Runaway loop — agent retries forever on a flaky test or remote API #

Symptom: agent spends 40 minutes retrying a single failing test, burning tokens, with no escalation to a human. CI bill spikes.

What aidokit does about it: watchdog hooks bound the number of times an agent can retry a single command. Hitting the cap creates a Beads blocker and stops, surfacing the failure to a human instead of grinding on it.

Stories: not yet contributed. Send yours.

Provenance gap — no audit trail on AI-generated code #

Symptom: regulatory or internal-policy reviewer asks "which lines in production were AI-assisted, by which model, with what review?" — and the team has no answer.

What aidokit does about it: the Strict tier ships change-summary artifacts, capability declarations, and an audit-export command (aidokit audit export --format soc2 — landing in C5). Each merged change carries machine-readable provenance.

Stories: not yet contributed. Send yours.


What we will not publish #

To keep this page honest and useful:


See also #