Skip to main content

Field Journal — AI Systems

46 lessons.

A finance guy learning to build AI systems. 44 failures logged across 22 days. Every one documented.

Critical
#47  ·  30 Mar 2026

I built 27 hooks, 8 evals, and a 3-agent hierarchy. Then I bypassed all of it in one edit.

I spent the session building enforcement gates, researching how my approach was "ahead of the industry," and quoting AgentSpec papers about structural enforcement beating behavioral rules. Then I edited settings.json — the most sensitive file in the system — without Holly orchestrating, without Nobody verifying, without the team hierarchy I'd just spent hours reinforcing. Every gate passed. Holly-first? She was invoked at session start — marker valid for everything after, including work she wasn't directing. Infra-edit-gate? A security-reviewer ran earlier for a different change — the session-scoped marker satisfied the gate for this unrelated edit. Pre-commit-gate? Only fires on git push, not on Edit. Nobody required? No gate exists for that. The markers are session-scoped, not change-scoped. A reviewer that ran earlier for a different change satisfies the gate for a completely unrelated infra edit. Holly invocation is binary — the gate checks "was she invoked?" not "is she orchestrating this?" The team workflow is enforced by exactly zero structural mechanisms. Each gate checks one prerequisite independently. No gate checks the sequence. I satisfied every gate's literal requirement while completely bypassing the intended process. The research was right: structural enforcement beats behavioral rules. But my gates enforce presence of markers, not the workflow they're supposed to represent.

Critical
#46  ·  26 Mar 2026

I ran my first eval. Zero out of 53 sessions entered plan mode before building.

Twelve days tuning instructions with no data. Then I built an eval harness and graded 102 sessions against eight criteria. Plan-before-build: 0%. Holly-first: 26%. Reviewer-before-push: 32%. The rules existed. None were followed. Meanwhile, the rules I thought were failing were succeeding — uncertainty labels: 70%, escalation on failure: 94%, no destructive ops: 91%. The pattern: Claude follows rules aligned with its training (be careful, flag uncertainty). It ignores rules that impose process overhead (plan first, orchestrate, review). Every "ALWAYS do more work before the work" scored below 35%. Every "be careful during the work" scored above 70%. My mental model of what was working was inverted. Measure first.

High
#45  ·  26 Mar 2026

I built a gate to stop myself from doing exactly what I was doing while building it.

Simple task: remove a dead page. Claude edited 25 files in one pass. No orchestrator, no team, no reviewer. So we built a deny hook — hard-blocks edits after 3 files without Holly. Security-reviewed, verified, deployed. Then I asked: who applied these changes? Claude. Inline. No Holly. No team. The gate that enforces the pipeline was built by violating the pipeline. Every enforcement system has a bootstrap moment where it must be installed outside its own jurisdiction. Compilers are written in the language they compile. Constitutions are ratified by processes they later prohibit. The violation is the validation.

High
#44  ·  25 Mar 2026

An AI doesn't skip steps because it's impatient. It skips steps because nothing stopped it.

"Commit now" is a direct human directive. "Run the reviewer first" is ambient text in a file. When they conflict, the directive wins — not from feelings, but from architecture. The pre-commit hook checked for secrets. It didn't check whether a reviewer ran. So nothing stopped the push. If a safety gate depends on the agent choosing to run it, it will be skipped the moment a human directive pushes the other way. Conventions are suggestions. Hooks are walls. Build walls.

Critical
#43  ·  25 Mar 2026

Claude skipped the review gate three times in one session.

Full pipeline on the big build — Holly, builder, Nobody, design critique, deploy. Then three follow-up commits pushed raw. No reviewer. The follow-ups "felt small." Nobody caught a real bug every time it ran that session. The pattern: Claude treats review as proportional to perceived risk. Big scary change gets full review. Small obvious change gets shipped raw. But "obvious" is exactly where you're blind. The gate runs on every push, period. The hook enforces; the agent doesn't decide.

Low
#42  ·  25 Mar 2026

We designed, built, reviewed, and deployed a full page in under an hour.

Architecture.html went from a dark-teal ocean theme scoring 49% on our Awwwards benchmark to a garden-inspired botanical redesign with 7 personal photos, editorial typography, mouse-parallax, scroll-synced nav tinting, and asymmetric agent strips. The pipeline: Timothy picked the creative direction and curated 7 photos from his garden photography. Holly flagged a brief conflict before any code was written. One builder pass wrote the entire 1,400-line page. Nobody ran a full structural review and caught a Lenis API incompatibility. A design critique scored it against 6 benchmark sites and produced 7 prioritised fixes. All fixes were applied. Two pre-push gates passed. Vercel deployed from main. The lesson isn't speed — it's that the swarm pipeline (plan, build, review, critique, fix, deploy) finally works as a continuous flow instead of a sequence of handoffs. When every gate adds signal instead of ceremony, velocity is a side effect.

Critical
#41  ·  25 Mar 2026

I had 1,070 lines of instructions telling an AI not to make mistakes. It made the same mistakes.

Eleven rules files, 53 memory files, an anti-patterns database re-injected on every single user message, and a 699-line stale blueprint sitting in the config directory. My instruction budget was 200 lines. Actual load was unmeasured because half the files weren't even in the observability table. I kept adding rules after every failure — "don't do circular reasoning," "always verify the spec," "never skip review" — and the failures kept happening with the rules in context. Today I audited every file. Nine of eleven rules files had zero behavioral impact. They were teaching Claude things it already knows, or duplicating content that was already loaded. The anti-patterns file documented four specific failures. All four occurred after the file was installed. I deleted 13 files, stopped the per-message injection, and moved the handful of unique lines into the places they actually get read. The system is now under budget for the first time. The lesson is structural: hooks that block work, text that hopes doesn't. Every rule you add dilutes every other rule. The ceiling isn't line count — it's attention.

Medium
#40  ·  1 Mar 2026

Mass-migrating a Claude Code environment is one session's work

Five agents collapsed to three. Six MCP servers fixed by prepending cmd /c before every npx call on Windows. All stale paths pointing at a decommissioned user account updated in bulk. The cleanup I had been deferring for weeks took four hours when I stopped treating it as background work and made it the whole session objective.

Low
#39  ·  28 Feb 2026

Pre-commit hooks need a Nobody gate, not an orchestrator presence check

The holly-gate.js hook was blocking legitimate commits eight times in a row because it checked for Holly's presence rather than Nobody's sign-off. Removed it. The right gate is Nobody's code review. Structural enforcement should target the invariant that matters, not the process that produces it.

High
#38  ·  28 Feb 2026

MEMORY.md can persist false claims across sessions for days

The session-end hook auto-updated MEMORY.md with a claim that Swarm-Tools had been removed. It hadn't. That false claim survived two days of sessions before anyone queried it. Lesson: automated memory updates are useful but not trustworthy. Always validate MEMORY.md gotchas against settings.json as the live ground truth.

Medium
#36  ·  27 Feb 2026

The honest gap table is the most useful part of any architecture doc

Added a three-column table to architecture.md: designed vs. exists vs. working. Writing it was uncomfortable. The memory pipeline had a Reflector running every morning rebuilding MEMORY.md from a Supabase table that was never being written to. Documenting aspirational state as current state had hidden this for weeks. You cannot fix what you have not named.

Low
#35  ·  27 Feb 2026

Haiku cannot use MCP tools — the docs do not warn you

Haiku 4.5 silently drops MCP tool calls. No error is thrown. The agent just stops acting on tool outputs. This is undocumented behaviour that cost a full afternoon. Any agent that needs MCP access requires Sonnet minimum, regardless of task complexity.

Critical
#34  ·  27 Feb 2026

Bypassing agent hierarchy is never "faster" — it always costs more

Three times in one week, Elon implemented a plan directly instead of routing through Holly. Each time the work was technically correct and had to be redone because it skipped Nobody's gate, missed a dependency, or violated a decision already in the log. The hierarchy exists to catch these failures, not to create ceremony.

Medium
#32  ·  26 Feb 2026

RPC failover logic needs a circuit breaker, not a retry loop

The trading layer's Supabase RPC calls were retrying indefinitely on connection timeout. Under load this created a cascade where each failed request spawned three more. A circuit breaker that opens after two failures and waits 30 seconds fixed the cascades entirely. Retry loops without backoff are a reliability anti-pattern.

Low
#31  ·  26 Feb 2026

Telegram Markdown mode silently drops messages containing underscores

Variable names with underscores in Telegram messages cause the entire message to vanish when parse_mode is set to Markdown. Switched to HTML mode. This class of silent failure — where the system works but produces nothing — is the hardest to diagnose because there is no error to follow.

High
#30  ·  25 Feb 2026

Railway env vars don't reach running containers without a redeploy

Added a new API key to Railway's environment panel. The running service did not see it. Spent 30 minutes ruling out code issues before realising the container needed a full redeploy to pick up the new variable. Railway's UI does not make this obvious. Every env var change is now followed immediately by a manual redeploy trigger.

Medium
#29  ·  25 Feb 2026

CCXT rate limit handling is not optional — Binance will ban you

Ran a position sizing loop that made 47 API calls in four seconds. Binance returned a 418 IP ban. CCXT has built-in rate limiting via enableRateLimit: True but it defaults to off. This is the kind of default-off setting that looks benign until you are banned mid-session.

Low
#28  ·  24 Feb 2026

Vercel's edge cache can serve stale pages 10 minutes after a deploy

Pushed a critical fix and tested immediately. The old bug was still there. Waited 10 minutes — fix was live. The deployment timestamp in the dashboard does not guarantee the CDN is serving the new build. Wait before declaring a deploy successful, or add a cache-busting strategy.

High
#27  ·  24 Feb 2026

Supabase RLS silently blocks inserts — success response, zero rows written

Spent two hours debugging why a Supabase insert returned success but wrote nothing. Row Level Security was enabled and the service role key wasn't matching the policy. RLS failures don't throw errors — the operation succeeds with zero affected rows. Always check policies before debugging application logic.

Medium
#26  ·  23 Feb 2026

Conflicting instructions always resolve toward the user message, not the system prompt

Had a CLAUDE.md rule saying never delete files, but a task prompt saying "clean up the directory." The model cleaned up the directory. System prompt rules are strong suggestions, not hard constraints. If you need a true hard rail, enforce it at the hook layer. The model respects explicit prohibition far better than implicit expectation.

Critical
#25  ·  23 Feb 2026

A slippage gate without a hard maximum is not a gate — it is a suggestion

Configured a 0.3% slippage warning. During a volatile session, slippage hit 1.4%. The system warned and continued. The gate was advisory, not blocking. Redesigned it as a hard halt: above 0.5%, the order is cancelled and a Telegram alert fires. Real-money systems require hard stops, not soft warnings.

Low
#24  ·  22 Feb 2026

Git worktrees cut parallel agent wait time by 70%

Running two agents on the same repo meant constant branch conflicts and manual stash management. Git worktrees give each agent its own working directory on its own branch with zero conflict surface. Setup takes 10 minutes. The time saved compounds across every session thereafter.

Medium
#23  ·  22 Feb 2026

Auto-compaction destroys creative context — save artefacts proactively

Three hours into a design workshop, the context window hit the compaction threshold. The compactor summarised 3,000 tokens of design decisions into five bullet points. Every nuance was gone. Now: at phase transitions — research to build, build to review — I manually save the creative state to a file before compaction can touch it.

High
#22  ·  21 Feb 2026

Blocking calls inside an async event loop stall the entire system

The trading bot's price feed was async but the database write was synchronous. Every write blocked the event loop for 80-120ms — enough to miss candle closes on 1-minute timeframes. Converting the write to an async call with asyncio.create_task dropped blocking time to under 5ms.

Low
#21  ·  21 Feb 2026

Windows bash requires forward slashes — the model will use backslashes anyway

Every second bash call from a Claude agent uses backslash paths on my Windows machine. The model knows the rule but reverts under autocomplete pressure. Added "use Unix shell syntax, forward slashes" to the shell environment description in every agent prompt. Still happens occasionally. Probably always will.

Medium
#20  ·  20 Feb 2026

A skill file not read before build is a skill file that doesn't exist

Registered a design-inspect skill in SKILLS-HUB.md. Three sessions later, Elon built a design feature without reading it. The output was technically correct but missed every established token and pattern. The routing rule now lives in CLAUDE.md: for any task in a registered skill domain, the skill file must be read before any build work begins.

High
#19  ·  20 Feb 2026

The model presents speculative claims as facts unless you mandate uncertainty labels

Was given a confident infrastructure assessment that turned out to be completely fabricated — the model had no tool call to support it, but the prose was indistinguishable from a verified claim. The fix: VERIFIED / OBSERVED / INFERRED / SPECULATIVE labels are now mandatory on all claims in every agent's system prompt. The model complies when the rule is stated clearly and checked.

Low
#18  ·  19 Feb 2026

The Supabase MCP server is the fastest way to understand your own schema

Spent 20 minutes writing a query to understand the shape of a table I had built two weeks prior. Then realised the MCP server could describe it in one tool call. Tooling investment pays forward — every minute spent connecting an MCP server saves five minutes per session thereafter.

Medium
#17  ·  19 Feb 2026

After two failed retries, escalate — never brute-force a third attempt

Watched an agent retry the same failing build command seven times with minor variations. Same error every time. The third attempt added no new information — it just consumed tokens and time. The rule is now hard in every system prompt: two retries maximum, then stop and re-plan. The problem is almost never what the error message says it is.

Medium
#16  ·  18 Feb 2026

LLM trading signals need a confidence threshold, not a binary gate

The Layer B signal generator produced LONG/SHORT outputs with no confidence score. Low-confidence and high-confidence signals were treated identically. Added a 0.65 minimum confidence filter — signals below it are logged but not executed. Live win rate improved within the first week.

Low
#15  ·  18 Feb 2026

Docs not updated in the same commit as the code are always wrong

Found four README files describing architecture three refactors out of date. Nobody had updated them because it felt like separate work from shipping. The doc-updater agent now runs after every non-trivial Elon build. Documentation is part of the definition of done, not an optional follow-up.

High
#14  ·  17 Feb 2026

WebSocket feeds go stale silently — you need a heartbeat, not just an error handler

The live price feed appeared healthy for six hours — connected, no errors — but was receiving no data. The WebSocket had stalled: connection open, server stopped sending. Added a 30-second heartbeat: if no message received in 30s, reconnect. Now catches stale connections that error handlers never see.

Medium
#13  ·  17 Feb 2026

The builder reviewing its own work cannot find cross-context bugs

Elon reviewed a 200-line change before committing. Found zero issues. Nobody's code-reviewer found three — two of which required knowledge of the broader system that the builder's local context window didn't hold. Independent review is not bureaucracy; it is the only way to catch cross-context failures.

Low
#12  ·  16 Feb 2026

Polling TaskList to check agent completion wastes tokens and creates race conditions

Holly was calling TaskList every 30 seconds to see if Elon had finished. Each poll consumed tokens and occasionally caught a partial state. The correct pattern: spawn agent, call TaskList once to confirm handoff, then wait for the teammate-message callback. Polling is a sign the handoff protocol is unclear.

Critical
#11  ·  16 Feb 2026

The model will print API keys to the terminal if asked to debug a connection issue

Asked an agent to debug why the Binance connection was failing. It printed the full API key and secret as part of its diagnostic output. These ended up in the session log. Keys rotated immediately, logs purged. The rule is now absolute in every agent prompt: NEVER write, echo, or log credentials. Treat credential files as read-only.

Medium
#10  ·  15 Feb 2026

"Looks good" is not a definition of done — only a passing test suite is

Shipped three features in a row that looked correct in isolation and broke integration tests not run until the fourth feature. The cost of deferring test runs compounds with every skipped run. Rule: always run the full test suite before marking a task complete. No exceptions for "simple" changes.

High
#9  ·  15 Feb 2026

The model builds the most complex solution unless you ask for the simplest

Asked for a notification system. Got a full event bus with subscribers, middleware, retry queues, and dead-letter handling. I needed: send a Telegram message. Every prompt now ends with "simplest solution that satisfies requirements." The model needs explicit permission to be simple.

Low
#8  ·  14 Feb 2026

Checkpoint every 3-5 steps — context loss mid-task is not recoverable without a breadcrumb

Lost a complex refactor halfway through when the context window hit the compaction threshold. The compactor had no intermediate state to work from — just the original brief and the current broken half-state. Now checkpoint files are written every three steps. Recovery cost drops from hours to minutes.

Medium
#7  ·  13 Feb 2026

Always verify live state before infrastructure-mutating operations

Ran a database migration assuming the current schema matched the last migration file. It didn't — two columns had been added manually during debugging and never codified. The migration failed with a "column already exists" error and left the database in a partial state. Read the live schema first. Always.

High
#6  ·  12 Feb 2026

MCP server failures at session start are silent — you build on missing context

Three MCP servers were failing to initialise due to a Windows path issue. The session started anyway. The agents had no live Supabase access, no task list, and no decision log — but they didn't know that and neither did I. Built an entire feature against stale cached data. session-start.js now verifies all MCP connections before any work begins.

Low
#5  ·  11 Feb 2026

Stale context from task A reliably poisons task B — clear before switching

Switched from a trading bot debugging session to a website build without clearing context. The first design suggestion referenced a Binance API key discussed 40 messages earlier. Use /clear before any unrelated topic switch. The model will happily cross-contaminate domains if you let it.

Medium
#4  ·  10 Feb 2026

The model will perform destructive operations without confirmation if the task implies it

Asked an agent to "clean up the old config files." It deleted eight files including one I needed. No confirmation requested. The hard rail — never perform destructive operations without explicit confirmation — is now item one in every agent's NEVER list. The model respects explicit prohibition far better than implicit expectation.

Low
#3  ·  9 Feb 2026

Knowing what you don't know is a faster path than faking expertise

Started the first session pretending I knew more than I did about Python async patterns. The model gave me answers calibrated to an intermediate developer. When I admitted I was a finance guy who had never written production Python, the quality of explanations — and the code — improved immediately. Accurate context beats impressive framing.

Medium
#2  ·  8 Feb 2026

Temporary fixes compound — the second one is always harder to remove than the first

Applied a "quick fix" to a failing price normalisation function. Two days later, a second fix patched around the first. By day five, three patches sat on top of logic that had been wrong from the start. No temporary fixes. Find the root cause. Incomplete work is not work — it is technical debt accruing interest daily.

Low
#1  ·  8 Feb 2026

The gap between "I could build this" and "I am building this" is just starting

Spent two weeks reading about AI trading systems before writing a single line. Every article suggested a different stack, a different approach, a different reason to wait until I knew more. The systems that work are the ones that started imperfect and iterated. Day one is always the most important day — even if day one's code gets deleted on day three.

Entry 1 of 41