Hall of Pain
Real, documented failures of unsupervised AI coding agents. The point of this page is not to dunk on the underlying tools — they are improving fast — but to make the category of failure that
aidokitexists to mitigate concrete, named, and searchable.
If you've experienced one of these failure modes in your own work, you are the
target user for aidokit. The Standard and Strict tiers exist specifically to
prevent the ones below.
Coverage status (v1.x) #
The 25-point analysis behind this page is now fully addressed by v1.x. The table below maps each pain category to the mechanism that mitigates it.
| Category | Status | Mechanism |
|---|---|---|
| Context window limits | 🟢 Strong-at-Standard | Role separation, sub-agents, per-tier memory file |
| Context poisoning / drift | ✅ Strong | scratchpad-hygiene + reset-context skills (always-on); doctor --hygiene |
| Hallucinated APIs / symbols | ✅ Strong | Context7 MCP, Read-before-edit skill, byte-compare gate, zero-LLM scaffolder |
| Token cost explosion | ✅ Strong | aidokit verify --budget |
| Stale documentation | ✅ Strong | aidokit doctor --drift with semantic reference-graph check |
| Tech debt from AI sprawl | 🟢 Strong-at-Standard | Memory-file anti-patterns; scope discipline |
| Non-determinism | ✅ Strong | Zero-LLM scaffolder; hermetic CI; AIDOKIT_DETERMINISTIC=1 |
| Verification gap | 🟢 Strong-at-Standard | Tester-Reviewer role; aidokit verify; change-summary |
| Security & supply-chain | ✅ Strong | .aidokit/capabilities.json; signed packages; five verify facets |
| Workflow fragmentation | ✅ Strong | Three first-party adapters; multi-adapter projects |
| Prompt injection (MCP) | ✅ Strong | untrustedOutput flag; quarantining-untrusted-output skill; doctor check; threat T18 |
| Scope creep | 🟢 Strong-at-Standard | Immutable task briefs; memory-file rules |
| Architectural intent loss | ✅ Strong | First-class ADRs (0001–0020) |
| Onboarding cost | ✅ Strong | Per-tier memory file; shared docs/ skeleton |
| Skill atrophy | ✅ Strong | SP9 hygiene gates (doctor --hygiene) |
| Multi-agent chaos | 🟢 Strong-at-Standard | Fixed role contracts; single Maintainer merge |
| Evaluation difficulty | ✅ Strong | aidokit eval runs ## Acceptance criteria |
| License / IP contamination | ✅ Strong | aidokit verify --license |
| Secrets leakage | ✅ Strong | aidokit verify --secrets; reinforced by .aidoignore |
| Reproducibility across OSes | ✅ Strong | CI matrix Win/macOS/Linux; deterministic emission |
| Model / CLI version drift | ✅ Strong | .aidokit/model.lock; doctor --model-drift |
| Agent runaway loops | ✅ Strong | aidokit verify --loop-cap |
| Code confidentiality | ✅ Strong | .aidoignore; @aidokit/core/aidoignore matcher |
| Dependency bloat | ✅ Strong | aidokit verify --deps |
| AI-code provenance | ✅ Strong | Beads task IDs; change-summary; ADRs |
Total: 21 ✅ Strong / 4 🟢 Strong-at-Standard / 0 partial / 0 unaddressed.
The 🟢 entries are tier-gated — strongest at Standard/Strict, present-as-convention at Minimum.
How to submit a story #
We collect failures here so future readers (and our own roadmap) see the real shape of the problem. Submission is one of three ways:
- GitHub issue — open one tagged
hall-of-painwith the template below. The maintainers triage and add to this page within a week. - Mailto —
feedback@aidokit.dev, subject line "Hall of Pain: …". - Pull request — edit
wiki/hall-of-pain/index.mddirectly.
Submission template (paste into a GitHub issue, then redact):
Title: <one-line summary>
Date: <YYYY-MM-DD>
Tool: <Claude Code | Cursor | Aider | Copilot | Codex | other>
Failure kind: <scope-leak | context-loss | tool-soup | runaway-loop | other>
Cost: <hours lost / dollars / story-points / "trust">
What happened:
<2–4 sentences. Public link if available.>
What `aidokit` would have done:
<Honest answer. If "nothing useful", say so. We learn from those too.>
Permission: <Public name OK / Anonymous / Redacted>
We attribute by name only with explicit permission. Anonymous stories are welcome and given equal prominence.
Failure taxonomy #
The stories below are organised by the kind of failure, not by the tool. The same tool can produce different failures in different conditions; the same failure can appear across multiple tools.
Scope leak — agent edits files it wasn't asked to touch #
Symptom: a feature branch contains unrelated formatting / refactor / "I noticed this was wrong" changes across files outside the task's stated scope. PR review takes 5× longer; revert is risky.
What aidokit does about it: the watchdog hook on file write checks the
task's declared scope and refuses writes outside it. The agent receives a
machine-readable refusal and can either re-plan or escalate the scope
explicitly.
Stories: not yet contributed. Send yours.
Context loss — agent forgets a decision made earlier in the session #
Symptom: the agent re-discusses an architectural choice the team made two weeks ago, contradicts an ADR, or "improves" a pattern that was deliberately chosen for a reason it doesn't see.
What aidokit does about it: the Beads task graph and decision log
(bd create --type message) survive session boundaries. The Architect role
reads docs/decisions/ and docs/03-architecture-summary.md before
proposing. Standard tier requires bd prime at session start.
Stories: not yet contributed. Send yours.
Tool soup — every agent has every tool #
Symptom: a code-review subagent has shell access and modifies files mid- review; a research subagent runs
npm install; token budget explodes because every agent has the full MCP catalog loaded.
What aidokit does about it: per-role MCP scoping. The Researcher gets
docs lookup; the Builder does not. The Tester-Reviewer is read-only and
literally cannot edit files. Strict tier verifies capability declarations
match the emitted hook configuration.
Stories: not yet contributed. Send yours.
Runaway loop — agent retries forever on a flaky test or remote API #
Symptom: agent spends 40 minutes retrying a single failing test, burning tokens, with no escalation to a human. CI bill spikes.
What aidokit does about it: watchdog hooks bound the number of times an
agent can retry a single command. Hitting the cap creates a Beads blocker
and stops, surfacing the failure to a human instead of grinding on it.
Stories: not yet contributed. Send yours.
Provenance gap — no audit trail on AI-generated code #
Symptom: regulatory or internal-policy reviewer asks "which lines in production were AI-assisted, by which model, with what review?" — and the team has no answer.
What aidokit does about it: the Strict tier ships change-summary
artifacts, capability declarations, and an audit-export command (aidokit
audit export --format soc2 — landing in C5). Each merged change carries
machine-readable provenance.
Stories: not yet contributed. Send yours.
What we will not publish #
To keep this page honest and useful:
- No anonymous attacks on named individuals or teams. Failures are attributed to patterns, not to people who once used the wrong prompt.
- No proprietary or confidential code. Redact aggressively or summarise.
- No "look how dumb the AI is" content. Failure modes interesting enough to warrant prevention; gotchas of the day are not.
- No off-topic infrastructure failures. If the database was the problem, this is the wrong wiki.
See also #
Compare aidokit to alternatives— howaidokitdiffers from SpecKit, BMAD, Superpowers, and rolling your ownConformance levels— which protections each tier shipsSecurity model— the watchdog hook contract and capability declarations