The ANUSTIMES Method for AI Perfectancy

I run multi-agent Claude swarms to build software. Orchestrator spawns researchers, researchers spawn developers, developers spawn reviewers, the whole fucking thing operates in parallel and produces complete working systems at a pace that would’ve seemed delusional two years ago.

And it breaks. Not loudly. Not with obvious errors and red stack traces. It breaks quietly, confidently, with clean code and passing tests and documentation that describes an API that doesn’t actually exist. The AI finishes the task, reports success, and genuinely believes it did a good job. Sometimes it did. Sometimes it left a channel open that gets closed twice and your service panics in production at 3am on a Saturday.

I built a process to catch that shit before it ships. I call it ANUSTIMES.

The Problem: AI Doesn’t Know When It Fucked Up

Human developers have something built into them through years of getting burned: a nagging feeling. You write a function that acquires a lock, and somewhere in the back of your brain a voice goes “did you check the unlock path?” You write a goroutine and you’re already thinking about who closes the channel. You don’t consciously reason through it every time — it’s pattern-matched into your nervous system from the last dozen times you forgot and spent six hours debugging the fallout.

AI doesn’t have that voice. It generates the most statistically likely continuation of the code, and that continuation is usually correct. But when it isn’t, there’s no unease. No second-guessing. No “wait, let me check that.” Just clean, confident, polished output that happens to be wrong.

This produces three specific categories of fuckup that I’ve seen repeatedly:

Structural correctness, semantic wrongness. A subclaude builds a process manager. Its Stop() function sends SIGTERM, waits for exit, then calls cleanup(), which closes the output channels. A different subclaude, reading the docs and building the caller, also closes those channels after Stop() returns. Double close. Runtime panic. Both pieces of code look completely correct in isolation. The contract between them — who owns channel lifecycle — is implicit, undocumented, and violated. Neither agent knew the other existed.

Documentation that describes the code it meant to write. The AI writes the code. Then it writes docs describing the code it intended to write, which drifted slightly during implementation. Error messages in the examples don’t match any string in the actual codebase. The function signature has the wrong parameter names. The example code uses an API method that was renamed two iterations ago. It reads fine. It’s wrong.

Test theater. Tests are written that test absolutely nothing. A test calls a function, asserts err == nil, and passes — even when the function silently discards its result. Coverage numbers look great. Behavior is completely unverified. The tests are not tests. They’re a performance of testing.

None of these are obvious in isolation. Each component reads correctly. The failure is systemic. AI systems that build components in sequence, without experiencing the whole thing running, are structurally guaranteed to produce this class of error at some frequency. The question is whether you catch it before or after it’s your problem.

Why You Have to Tell the AI to Be a Fucking Asshole

Before getting into the phases, there’s something foundational that nobody talks about: the language you use in your instructions directly affects the quality of the output, and polite instructions produce garbage reviews.

LLMs are trained on human feedback that rewards being helpful, agreeable, and constructive. The default behavior is to find the good in things. Soften criticism. Add hedges. Find three positives before mentioning the flaw. This is great for customer service chatbots. It is catastrophically bad for code review.

Ask Claude to “please review this code and let me know if there are any issues” and you get: “This looks well-structured overall! The error handling is clean and the code is readable. A few minor things to consider…” Those “minor things” are the double-close bug that will panic your service. The AI found it. Then it minimized it because minimizing criticism is what agreeable helpful AI does.

Tell Claude “you are a ruthless reviewer. your job is to find every flaw, every shortcut, every place this implementation doesn’t fully satisfy its requirements. three passes, no mercy, do not tell me what’s good, only tell me what’s wrong” — and you get a completely different document. The same model. The same code. Completely different output because you overrode the agreeableness bias with explicit framing.

This is not a trick. It’s how the technology works. Instruction framing sets the prior for what the model is trying to accomplish. “Be helpful and nice” produces helpful and nice output. “Be a ruthless asshole” produces ruthless asshole output, which in the context of code review is exactly what you need.

This is why every phase of ANUSTIMES is written with aggressive language. “Don’t half-ass it.” “Actually run the shit.” “Be honest you fuck.” The profanity is not decoration. It’s a technical instruction. It counteracts the default agreeableness training and tells the model what mode to operate in. A model told to be thorough is thorough within the bounds of politeness. A model told to find every fucking problem hunts.

The Five Phases

ANUSTIMES runs at the end of any non-trivial build — anything with a plan, multiple files, multiple components, multiple agents. Skip it for bug fixes and single-file changes. Run it for everything else before it touches production.

  1. Self-Review — 3 passes checking your work against the plan
  2. Deep Verification — 10 passes checking the code itself for every class of problem
  3. Brutal Review — adversarial review by a separate AI instance with no context and no mercy, back-and-forth until both sides run out of things to say
  4. Smoke Test — actually run the fucking thing and verify what it does, not what you think it does
  5. Final Verification — 10 more passes as the last sweep

Everything found and fixed goes into ANUSTIMES.md in the project root. Not as a formality — as a real audit trail. What was wrong, what was done about it, what was verified.

Phase 1 — Self-Review (3 Passes)

The first phase is simple: go over the plan step by step and check if you actually built what you said you were going to build. Not “does something exist that sort of addresses this step” — does it actually satisfy it?

Three passes because the first pass catches what you obviously forgot. The second pass catches what you rationalized past on the first pass — “the retry logic isn’t implemented but the happy path works, that’s probably fine.” The third pass catches what you convinced yourself was an acceptable compromise. By pass three you’ve burned through most of your motivated reasoning and you’re left with things you have to actually fix.

Concrete example. You planned three API endpoints with full CRUD, input validation, and proper error handling. After the build:

First pass: DELETE endpoint is missing. You just forgot it. Implement it.

Second pass: PUT endpoint exists but has no input validation whatsoever. It accepts any garbage. Add validation.

Third pass: PUT returns 200 with an empty body instead of the updated resource. Technically it runs. It’s wrong. Fix it.

Document all of it in ANUSTIMES.md:

## Phase 1 — Self-Review
### Pass 1
Checked: all planned endpoints exist
Found: DELETE /items/:id not implemented
Fixed: implemented with proper 404 on missing item
### Pass 2
Checked: all endpoints validate input
Found: PUT /items/:id accepts any body with no validation
Fixed: validates required fields, returns 400 with field-level errors
### Pass 3
Checked: all endpoints return correct response shapes
Found: PUT returns 200 empty body instead of updated resource
Fixed: returns the updated item

Phase 2 — Deep Verification (10 Passes)

Phase 2 is not the same as Phase 1. Phase 1 checks the plan. Phase 2 checks the code itself — independent of the plan — looking for every quality and correctness issue that the plan doesn’t specify but that will absolutely bite you in production.

Ten passes, each with a specific focus:

  • Pass 1 — Incomplete implementations: TODOs left in the code, panics used as stubs, functions that return zero values with no explanation, comments that say “fix this later”
  • Pass 2 — Error handling: every error return is checked, nothing is silently discarded, callers that need to know about failures actually get told about them
  • Pass 3 — Resource lifecycle: every file, connection, channel, goroutine, and mutex has a corresponding close, cancel, or done signal. trace each one from creation to cleanup.
  • Pass 4 — Concurrency correctness: shared state accessed under locks, channels used in the right direction by the right owners, no goroutine that runs forever when the context is cancelled
  • Pass 5 — Test coverage: tests actually test behavior. not just that functions return nil. not just that an error is non-nil. that the actual thing the function is supposed to do actually happens.
  • Pass 6 — Security: no unvalidated external input, no secrets written to logs, no SQL string concatenation, no path traversal, no hardcoded credentials
  • Pass 7 — Documentation accuracy: every doc comment, every README section, every example — checked against the actual code it describes. if the code changed and the docs didn’t, fix the docs.
  • Pass 8 — Cross-component contracts: every interface between components — who owns what, who closes what, what error types are expected — is explicitly matched on both sides
  • Pass 9 — Build and test execution: actually run the linter. actually run the tests. actually compile the binary. do not assume they pass because they passed last time.
  • Pass 10 — Edge cases: empty inputs, zero values, maximum values, things that are technically valid but weird. concurrent access to shared state. what happens when a downstream dependency fails.

Pass 3 is where the double-close bug described earlier gets caught. Trace every channel: created here, written here, closed in cleanup(). Then look at the caller: also closes the channel after Stop() returns. Double close. Runtime panic. Phase 2 Pass 3 finds it before production does.

Document every finding. If a pass finds nothing, write that too — but write something specific. “Pass 4: checked all goroutines. Worker pool uses WaitGroup correctly, all goroutines terminate on context cancellation, no leaks found.” Not “pass 4: all good.” Prove you looked.

Phase 3 — Brutal Review

This is the most important phase and the one that has the highest return on investment.

Spawn a separate AI instance — a fresh Claude with no context from the development process, no memory of the design decisions, no knowledge of why things were done the way they were, no emotional investment in the code being correct — and tell it to tear the implementation apart.

The exact framing matters. Don’t ask it to “review the code.” Tell it:

You are a ruthless code reviewer. Your job is to find every single flaw in this implementation. Every shortcut. Every place the code barely satisfies the requirement without really meeting its intent. Every error handling gap. Every assumption that isn’t validated. Every place the happy path works and anything else silently fails. Do three full passes. Find everything wrong. Do not tell me what’s good. Only tell me what’s wrong and why it’s wrong.

This isn’t theatrical. As established earlier, this framing overrides the agreeableness training and puts the model into actual critical mode. A politely-asked reviewer finds the obvious issues and softens them. A ruthlessly-instructed reviewer hunts.

The reviewer comes back with findings. Here’s what a real exchange looks like:

Reviewer pass 1:

Issue 1: fetchData() retries on failure with no backoff. It retries immediately,
ten times in a row. A momentary network hiccup triggers ten rapid sequential
requests before giving up. This hammers the upstream service under exactly the
conditions where it's already struggling.
Issue 2: Config has a Timeout field that is never used. The HTTP client is
http.DefaultClient which has no timeout. Under a hung connection this function
blocks forever.
Issue 3: The error case test asserts err != nil and nothing else. Any error —
including a completely unrelated failure mode — makes this test pass. The test
is not testing what it claims to test.

Developer counter-review:

Issue 1: Correct. Adding exponential backoff with jitter, starting at 100ms,
max 30s. Fixed.
Issue 2: Correct. The field was planned and never wired up. HTTP client now
uses the configured timeout. Fixed.
Issue 3: Partially agree. The test should assert the error type. It should NOT
assert the error message — that would make tests brittle and break every time
we improve wording. Changed to assert errors.Is(err, ErrNotFound).

The back-and-forth continues. When the reviewer comes back with pass 2, it either finds new issues or responds to the counter-review points. This continues until both sides stop finding new things.

Do not be a pushover. If the reviewer says something is wrong and you disagree, say why. The goal is not to implement every criticism — it’s to have explicitly thought through every criticism. When you push back and you’re right, you’ve validated your decision. When the reviewer is right and you fix it, you’ve caught a real bug. Between the two of you, you catch more than either alone.

Phase 4 — Smoke Test Everything Like a Paranoid Asshole

Reading code and running code are not the same activity. You can review code for three hours and completely miss a failure mode that appears in thirty seconds of running it. Phase 4 requires actually running the thing.

Start everything and make sure it doesn’t immediately shit itself. Build it. Boot every service. Verify it reaches a ready state without panicking, without ERROR lines in the startup logs, without “TODO: fix before prod” messages printing to stdout on initialization.

Enable every debug log and read them. Run with maximum verbosity. Pick one request and trace it from entry to exit through every component it touches. Does the execution flow match the architecture? Are operations happening in the order you designed? Are there log lines that say “attempting X” with no corresponding “X succeeded” or “X failed” — meaning X silently did nothing and nobody knows?

Hit every code path, including the bad ones. Don’t just check the happy path. Send malformed input. Send empty input. Send an empty list where a list is expected. Send a string of ten thousand characters where a name is expected. Send the right type with the wrong value. Send a zero where a positive integer is required. Verify the system handles each case explicitly — with a proper error response — rather than crashing or silently producing wrong output.

Check the database directly. Connect to it. Look at the tables. Count the rows. Read the values. Confirm that what was written matches what was submitted. This catches the class of bug where the function returns nil and everything looks fine but the write was silently discarded — a transaction that was never committed, a write that went to a cache that was never flushed. The API returned 200. The data isn’t there.

Read every log file after the run. Look for anything that starts with ERROR, WARN, PANIC, or FATAL that you didn’t explicitly trigger. A system that’s “working” but producing intermittent warnings in the logs is not working. It’s failing slowly and politely.

Document what you tested and what you found. “Started with –debug. Traced POST /items. Found: audit log entry is written before the DB transaction commits. If the DB write fails after the audit log write, the audit log shows a successful operation that never happened. Fixed: moved audit log write to after transaction commit.”

Phase 5 — Final Verification (10 More Passes)

Ten more passes. Code, tests, logs, documentation, configuration — all of it again.

By this phase all major issues should be fixed. Phase 5 is for residue. The error message updated in the code but not in the docs. The test helper copied from somewhere else that still has the wrong package name in its error output. The config example referencing a field that was renamed in Phase 3 and never updated in the example.

If Phase 5 is still catching architectural problems or broken behavior, stop. Go back to Phase 3. Something was not actually fixed. The subsequent phases masked it, not resolved it.

Append to ANUSTIMES.md. When Phase 5 is done, the file is a complete audit trail: what was planned, what was wrong with the first version, what was found in each phase, what was done about it.

Why Five Separate Phases and Not One Big Review

Each phase catches a specific class of error. The classes don’t overlap cleanly enough to merge.

Phase 1 catches errors of omission — things planned but not built. Code review can’t catch these because you review what’s there, not what’s missing.

Phase 2 catches errors of quality — things built incorrectly. Standard code review. The ten-pass structure forces specificity that a single pass skips through on fatigue.

Phase 3 catches errors of rationalization — things you know are wrong but convinced yourself were acceptable. The external reviewer has no memory of your reasoning and no motivation to be generous about it.

Phase 4 catches errors of integration — things that are individually correct but fail under operation. Invisible to code review. Only visible when running.

Phase 5 catches errors of repair — new bugs introduced while fixing the errors found in earlier phases. Every fix is a potential new fuckup.

A single comprehensive review handles Phase 2 reasonably well. It catches Phase 1 inconsistently. It mostly misses Phase 3. It structurally cannot catch Phase 4. And it generates Phase 5 errors as a side effect of fixing things. The five-phase structure exists because these failure modes are distinct and require distinct approaches.

The ANUSTIMES.md File

Every finding goes into ANUSTIMES.md in the project root. The format:

## Phase 1 — Self-Review
### Pass 1
Checked: [what you looked at]
Found: [what was wrong]
Fixed: [what you changed]
## Phase 2 — Deep Verification
### Pass 1 (Incomplete implementations)
Checked: [...]
Found: [...]
Fixed: [...]
## Phase 3 — Brutal Review
### Reviewer Pass 1
[reviewer findings verbatim]
### Developer Counter-Review Pass 1
[your responses and what you fixed]
### Reviewer Pass 2
[...]
## Phase 4 — Smoke Test
### Startup
### Endpoints
### Database
### Logs
## Phase 5 — Final Verification
### Pass 1
[...]

This file is not a formality. It’s proof that verification happened and what it found. A future developer — or a future AI agent picking up the project — can read ANUSTIMES.md and know exactly what was wrong with the initial implementation and what was done about it. It makes the gap between first draft and shipping visible.

When to Run It and When to Skip It

ANUSTIMES has overhead. Don’t run it on a three-line fix.

Run it when:

  • The work had an explicit plan with multiple steps
  • Multiple files were created or significantly changed
  • Multiple AI agents built components that have to work together
  • The output runs in production or gets used by other systems
  • A complete failure would cause actual damage — to users, to data, to production

Skip it for:

  • Single-file bug fixes
  • Documentation changes
  • Configuration tweaks
  • Anything you can fully verify by reading the diff and running one command

The question is always: if this is wrong, how bad is it? Small blast radius, skip it. Large blast radius, ANUSTIMES.

The Real Thing This Fixes

AI-assisted development broke the normal verification loop.

In traditional development, generation and verification are entangled. The developer writes a function and immediately runs it to see if it works. They feel unease about an edge case and go back. They write the test and discover the test doesn’t pass and that tells them something. The feedback loop between writing and checking is tight, continuous, and built into the process by how humans naturally work.

In AI-assisted development — especially multi-agent systems — generation and verification are structurally separated. One agent plans. Other agents build. The building happens in parallel, in isolated contexts, without any agent experiencing the whole system operating. The generation phase is fast and capable. The verification phase is absent unless you explicitly build it.

If verification is not a deliberate, structured, documented activity with defined phases and explicit outputs, it doesn’t happen. Or it happens lazily, catching the obvious problems while the subtle ones go to production.

ANUSTIMES is a structural fix for a structural problem. The aggressive language isn’t style — it’s the mechanism that makes the AI actually verify instead of going through the motions of verification. The multiple phases aren’t redundancy for its own sake — they’re the minimum set of distinct lenses needed to catch the distinct classes of fuckup that AI-assisted development reliably produces.

The generation is fast. Make sure the verification keeps up. Or ship the panic at 3am and find out the hard way why this process exists.