Harness Engineering Fully Decoded: When AI Agents Finish Writing Code, Is Your Repo Ready to Catch It Automatically?

Disclaimer: This post is machine-translated from the original Chinese article: https://ai-coding.wiselychen.com/harness-engineering-control-plane-pattern-agent-review-loop/
The original work is written in Chinese; the English version is translated by AI.

📝 Originally written in Chinese. Read the original →

Peter Steinberger's workstation: multiple monitors tracking Agent output simultaneously

First, the source: what exactly is OpenAI’s Harness Engineering?

In early February, OpenAI’s engineering team published a blog post: Harness engineering: leveraging Codex in an agent-first world, written by Ryan Lopopolo.

This post describes an extreme experiment: a 3-person engineering team, 5 months, 1 million lines of code shipped, 0 lines written by humans. All the code—application logic, tests, CI config, docs, observability, internal tooling—all generated by Codex. They estimate this saved 10x the time.

A few key numbers:

1,500 PRs opened and merged in 5 months
Average of 3.5 PRs per engineer per day, and as the team grew from 3 to 7 people, throughput kept rising
A single Codex run can keep working for 6+ hours (usually while humans sleep)
The team used to spend 20% of every Friday cleaning up “AI slop”—later, they automated that too

The core philosophy is eight words: Humans steer. Agents execute.

And Peter Steinberger (author of OpenClaw) might be the most extreme witness of this philosophy in the wild. Someone tallied his Top 5 single-person single-day commit counts:

Date	Peter’s solo commit count
Feb 22 (Sun)	627
Feb 16 (Mon)	490
Feb 15 (Sun)	461
Feb 14 (Sat)	447
Feb 21 (Sat)	315

A day has only 1,440 minutes. 627 commits—no eating, drinking, sleeping, or bathroom breaks—averages a commit every 2.3 minutes.

This is obviously not a human hand-writing code. This is the real output of Codex + Harness running at full tilt. Peter previously ran 50 Codex instances in parallel reviewing 3,000 PRs, which was already absurd. Now his daily commit volume directly proves one thing: when your repo has a sufficiently complete harness, the ceiling on Agent output isn’t the model’s capability—it’s whether your control plane can catch it.

But the most valuable thing in this blog post isn’t the numbers. It’s the entire methodology they worked out for “how to make Agents reliably work inside a repo.” They call this Harness Engineering—not the engineering of writing code, but the engineering of building frameworks, constraints, and feedback loops.

Specifically, it includes several core insights:

1. The Repository is the System of Record

They tried stuffing all instructions into one giant AGENTS.md. It failed. The reason is blunt: context is a scarce resource, “everything important” means nothing is important, and a big file goes stale in an instant.

Their final approach: treat AGENTS.md as a table of contents (around 100 lines), pointing to a structured knowledge base in docs/. All Slack discussions, architectural decisions, design principles—everything must be deposited into the repo. What the Agent can’t see, doesn’t exist.

2. Use architectural constraints to box in the Agent, not micromanage the implementation

They built a strict layered architecture: Types → Config → Repo → Service → Runtime → UI, where each layer can only depend forward, never backward. Violations get automatically blocked.

The key: these constraints are enforced with custom linters and structural tests, and the linter error messages embed the fix instructions directly. The moment the Agent makes a mistake, how to fix it is already injected into the context.

3. Entropy Management is continuous engineering

Agents will faithfully copy existing patterns in the repo—including bad ones. An anti-pattern gets replicated to multiple modules within days. Their solution: run background Codex tasks periodically to scan drift, update quality scores, open refactor PRs. Continuous cleanup like garbage collection, not saved up for a painful one-shot later.

4. Let the Agent “see” the running state of the application

They made the application startable independently from each git worktree, hooked up to Chrome DevTools Protocol, so Codex can directly take screenshots, manipulate the DOM, query logs, and check metrics. The Agent doesn’t just write code—it can run the app itself, verify itself, and fix bugs itself.

This post generated a huge response in the community. Ryan Carson (@ryancarson), inspired by it, translated the Harness Engineering philosophy into a concrete, reproducible control-plane pattern—a complete control loop from PR open to merge.

His goal is even more blunt:

“I’ve been grinding with Codex (on Extra High) through setting up our repo for Harness Engineering. The goal is to have Codex write and review 100% of the code.”

From Harness Engineering to the Control-Plane Pattern

The OpenAI post tells you “why build a framework” and “what framework to build.” Ryan Carson tells you how to build it—down to every GitHub workflow, every line of TypeScript, every edge case.

Let me unpack it layer by layer.

Step 1: A machine-readable Contract

The first thing Ryan did wasn’t writing tests or configuring CI. It was writing a JSON contract.

{
  "version": "1",
  "riskTierRules": {
    "high": [
      "app/api/legal-chat/**",
      "lib/tools/**",
      "db/schema.ts"
    ],
    "low": ["**"]
  },
  "mergePolicy": {
    "high": {
      "requiredChecks": [
        "risk-policy-gate",
        "harness-smoke",
        "Browser Evidence",
        "CI Pipeline"
      ]
    },
    "low": {
      "requiredChecks": ["risk-policy-gate", "CI Pipeline"]
    }
  }
}

This contract does three things:

Defines risk tiers: which paths are high-risk (API, DB schema, tool functions), which are low-risk
Defines merge conditions: high-risk paths must pass four checks, low-risk only two
Eliminates ambiguity: all rules live in one place, not scattered across workflow files, scripts, and docs

Why does this matter? Because when you have 50 Codex instances running at the same time (yes, Peter Steinberger scale), you can’t rely on humans to remember “this directory changed, which checks do we need.” The rules have to be machine-readable, and there has to be exactly one copy.

One contract eliminates silent drift between scripts, workflows, and docs.

Step 2: Preflight Gate — Block first, run later

This is one of Ryan’s smartest designs, in my opinion.

The traditional approach: a PR opens, and CI fires everything—tests, builds, security scans, code review all start in parallel. After everything finishes, you see what passed and what didn’t.

Ryan’s approach: run the preflight gate first. Only if it passes do you start running the expensive CI.

const requiredChecks = computeRequiredChecks(changedFiles, riskTier);
await assertDocsDriftRules(changedFiles);
await assertRequiredChecksSuccessful(requiredChecks);

if (needsCodeReviewAgent(changedFiles, riskTier)) {
  await waitForCodeReviewCompletion({ headSha, timeoutMinutes: 20 });
  await assertNoActionableFindingsForHead(headSha);
}

The logic is simple:

First check what files changed, determine the risk tier
Confirm docs aren’t drifting
If a code review agent is needed, wait for it to finish
Only after all that passes, let test/build/security start running

What you save isn’t just CI minutes. When you open 50 PRs a day, each PR’s full CI takes 15 minutes, that’s 750 minutes of CI time per day. If 30% of PRs get blocked at preflight, you save 225 minutes a day. Over 100 hours a month.

But more importantly: preflight gate enforces deterministic ordering. Policy checks first, then review, then CI. This order cannot be scrambled.

Step 3: SHA Discipline — Ryan says this is the biggest practical lesson

Ryan said it himself in this section:

“This was the biggest practical lesson from real PR loops.”

Here’s the problem: the code review agent finished running on commit A and said “clean.” Then the Agent pushed a fix commit, and now HEAD is commit B. You take commit A’s review result to decide whether to merge commit B—that’s wrong.

You’re using old “clean” evidence to green-light new code. That’s the same as not reviewing at all.

Ryan’s rules:

Review status is only valid when it matches the current PR HEAD commit
Ignore all summary comments bound to old SHAs
After every push/synchronize, review must be re-run
If the most recent review run isn’t success or it times out, fail immediately

This looks like a small thing, but in high-frequency Agent push scenarios, this is the lifeline of whether your merge results can be trusted.

I hit the same problem using Claude Code. After the Agent fixed the first round of review issues and pushed new code, the fix itself introduced new issues. If I hadn’t re-run the review, that new problem would have gone straight into main.

Step 4: SHA Dedupe on Rerun Comments

This is a very practical engineering problem.

When multiple workflows can all trigger a review rerun, a pile of duplicate bot comments appear under the PR. Worse is the race condition—two workflows send rerun requests at the same time, the review agent runs twice, and the results overwrite each other.

Ryan’s solution is pure engineering: only one canonical workflow issues rerun requests, with dedupe by marker + SHA.

const marker = '<!-- review-agent-auto-rerun -->';
const trigger = `sha:${headSha}`;
const alreadyRequested = comments.some((c) =>
  c.body.includes(marker) && c.body.includes(trigger),
);

if (!alreadyRequested) {
  postComment(`${marker}\n@review-agent please re-review\n${trigger}`);
}

HTML comment as the marker (users can’t see it), SHA as the dedupe key. The same HEAD never gets rerun twice.

This isn’t the kind of thing you’ll see in any architecture document. But the moment you actually run it, you’ll hit it.

Step 5: Automated Remediation Loop — Let the Agent fix it itself

So far we’ve solved “how to block” and “how to review.” But there’s still a question: after a problem is found, who fixes it?

Traditional answer: humans.

Ryan’s answer: let the coding agent read the review context, patch itself, run local validation itself, push the fix commit to the same PR branch itself.

Then the PR’s synchronize event triggers the normal rerun flow. A perfect closed loop.

But Ryan added three guardrails:

Pin model + effort: fix the model version and effort level, guarantee reproducibility
Skip stale comments: ignore old comments that don’t match the current HEAD
Never bypass policy gates: the remediation agent also has to pass every gate—no special privileges

The third point is the key. If you let the remediation agent bypass the policy gate, you’ve opened a hole in your control loop. The Agent can push any code under the excuse of “I’m fixing a bug.”

Step 6: Auto Resolve of Bot Threads

The review bot opens a pile of conversation threads. If the bot’s issue has already been fixed in a new commit, these threads should be auto-resolved, otherwise GitHub’s required conversation resolution will block merge.

But Ryan added one important condition:

Only auto-resolve threads where every comment is from a bot
Never auto-resolve threads where a human has participated

Why? Because a human comment represents human judgment and intent. Auto-resolving on behalf of a human is making the decision for them.

After resolving, run the policy gate again to ensure the conversation resolution state is current.

Step 7: Browser Evidence — Screenshots aren’t enough, it has to be first-class proof

Reviewing UI changes is the easiest place to cut corners. A lot of teams just “paste a screenshot into the PR.”

Ryan’s bar is higher: browser evidence must be a first-class artifact in CI, with manifest and assertions.

npm run harness:ui:capture-browser-evidence
npm run harness:ui:verify-browser-evidence

The verification covers:

All required flows were exercised
The correct entrypoint was used
The login flow used the correct account identity
The artifact is fresh and valid

A screenshot is a snapshot. Evidence is verifiable proof. The difference shows up especially clearly when you’re doing compliance audits.

Step 8: Harness Gap Loop — Memorize incidents

The last one, and in my opinion the easiest to overlook:

production regression → harness gap issue → case added → SLA tracked

Every production incident isn’t just fixed and forgotten. Instead:

Open a harness gap issue
Turn the reproduction condition into a test case
Add it to the harness
Track the SLA (how long to fix, how long to add the case)

This ensures the fix doesn’t become a one-off patch. The same problem doesn’t happen a second time, and long-term test coverage is growing.

The Complete Control-Plane Pattern

String the eight steps together and you have this complete control plane:

Step	Function	Determinism
1. Risk Contract	Define rules, eliminate ambiguity	Fully deterministic
2. Preflight Gate	Block before running, save CI cost	Fully deterministic
3. SHA Discipline	Only trust evidence from current HEAD	Fully deterministic
4. Rerun Dedupe	One canonical writer, no duplicates	Fully deterministic
5. Remediation Loop	Agent fixes itself, no gate bypass	Semi-deterministic (model is non-deterministic)
6. Bot Thread Resolve	Auto-clean bot threads, leave human ones alone	Fully deterministic
7. Browser Evidence	UI evidence as CI artifact	Fully deterministic
8. Harness Gap Loop	Incidents turn into test cases	Fully deterministic

Notice? 7 out of 8 steps are fully deterministic. Only the remediation loop has an LLM in it—everything else is pass/fail logic machines can evaluate.

This is completely aligned with the four-layer defense view I wrote before: use deterministic tools to box in non-deterministic AI.

How does this relate to the four-layer defense from last time?

Last time, the four-layer defense (Test → Lint → CI Gate → LLM Judge) was vertical—each layer digs deeper into checking a single PR.

Ryan’s control-plane pattern is horizontal—the complete lifecycle from PR open to merge.

The two aren’t in conflict—they’re complementary. The four-layer defense is one component inside the control plane. Specifically, it’s what runs inside Step 2 Preflight Gate and Step 5 Remediation Loop.

Putting them together:

PR opens
  │
  ├── Step 1: Risk Contract → determine risk tier
  ├── Step 2: Preflight Gate
  │     ├── Layer 1: Test
  │     ├── Layer 2: Lint + Type Check
  │     ├── Layer 3: CI Gate (coverage, security)
  │     └── Layer 4: LLM Judge (code review agent)
  ├── Step 3: SHA Discipline → confirm evidence matches current HEAD
  ├── Step 4: Rerun Dedupe → avoid duplicate reruns
  ├── Step 5: Remediation Loop → Agent self-fix → back to Step 2
  ├── Step 6: Bot Thread Resolve → clean up resolved threads
  ├── Step 7: Browser Evidence → UI evidence verification
  ├── Step 8: Harness Gap Loop → incident turns into test case
  │
  └── Merge

This is a complete repo architecture that lets Agents write + review + fix code.

General pattern vs. specific implementation

Ryan specifically emphasizes this is a general pattern, not a specific toolchain.

General concept	Ryan’s specific implementation
Code Review Agent	Greptile
Remediation Agent	Codex Action
Canonical Rerun Workflow	`greptile-rerun.yml`
Stale Thread Cleanup	`greptile-auto-resolve-threads.yml`
Preflight Policy	`risk-policy-gate.yml`

You can swap Greptile for CodeRabbit, CodeQL, or a self-built LLM review. You can swap Codex for Claude Code, Cursor, Devin. The semantics of the control-plane don’t change—only the integration point does.

This is why I think this post deserves its own write-up—it’s not pitching a tool, it’s defining an architectural pattern.

Against my own hands-on experience

In “Make CI/CD Great Again” I shared my experience using three Agents to do multi-role code review. Looking back now through Ryan’s framework, what was I missing?

No Risk Contract. My review treated every file the same. But realistically, db/schema.ts and README.md are 10x apart in risk tier.
No SHA Discipline. After my Agent fixed an issue and pushed a new commit, I didn’t enforce a review re-run. I was relying on human eyes to decide “it’s fixed.”
No Remediation Loop. When the review found issues, I either fixed them myself or re-prompted the Agent. There was no automated fix → re-review loop.
No Harness Gap Loop. When something went wrong in prod, we fixed it. No systematic conversion into a test case.

To quantify as a percentage, I had maybe 40% of Ryan’s eight steps—Test, Lint, CI Gate, LLM Judge—but the other four steps of the control plane (contract, SHA, dedupe, remediation loop) were completely missing.

This is the gap between “it runs” and “it scales.”

Tool recommendations: what do you need to build a Control-Plane?

The tech stack involved in the whole Harness Engineering setup breaks into four categories. The key point: this is a general pattern, and each category can be swapped for whatever alternative you know.

1. AI models and Agents

This is the “engine” producing and fixing code:

OpenAI Codex: Ryan and OpenAI’s team’s main tool, supports long autonomous runs (6+ hours per run)
Claude Code: what I use myself, good for scenarios that need deep codebase context understanding
Other options: Cursor, Devin, Windsurf, etc.—anything that can plug into a PR workflow works

2. Automated code review

The “LLM Judge” role in the Control-Plane:

Greptile: the code review agent used in Ryan’s concrete implementation, understands codebase semantics
CodeRabbit: another mainstream option, which I used in “Make CI/CD Great Again”
CodeQL: GitHub’s native static analysis, leaning toward security vulnerability detection
Self-built LLM Review: wrap GPT-4 / Claude with a review prompt—maximum flexibility but also highest maintenance cost

3. CI/CD and infrastructure control plane

This is the actual “safety net” that catches AI output, relying mostly on the GitHub ecosystem and automation scripts:

GitHub Actions: the execution environment for the whole control plane. Specific workflows mentioned in the post include:
- risk-policy-gate.yml (Preflight gate)
- greptile-rerun.yml (dedupe mechanism for repeated triggers)
- greptile-auto-resolve-threads.yml (auto-cleanup of bot comments)
JSON Contract: used to write a machine-readable Risk Contract, defining which directories (like db/schema.ts) need stricter defenses
TypeScript: used for writing custom preflight logic (Preflight Gate) and dedupe logic (Marker + SHA Dedupe)
Git primitives: deep reliance on PR synchronize events, HEAD commit SHA tracking, and hidden HTML comments () for state management

4. Application runtime and UI verification tools

To solve the “Agent can’t see the running state” problem and let the AI verify itself:

Chrome DevTools Protocol (CDP): OpenAI’s team hooked the application up to CDP, so Codex can directly manipulate DOM, query logs, check metrics, and take screenshots
npm scripts: used to produce Browser Evidence, e.g. npm run harness:ui:capture-browser-evidence and npm run harness:ui:verify-browser-evidence
Git Worktree: lets each Agent run in its own worktree, starting apps and running tests without interfering with each other

Minimum viable tool combo

If you’re a 2-3 person small team, you don’t need everything. My suggestion is start with these three:

Priority	Tool	Cost	Corresponding Step
P0	JSON Risk Contract + GitHub Actions	Free	Step 1, 2
P0	Git SHA tracking scripts	Free	Step 3, 4
P1	CodeRabbit or Greptile	$19-49/month	Step 2 Layer 4
P2	Codex or Claude Code	Usage-based	Step 5

The first two are pure discipline, zero cost, you can do them right now. The latter two depend on team size and budget.

Honestly

Ryan’s architecture is very complete, but I have a few practical questions I haven’t figured out:

What I’m more confident about:

Risk Contract is mandatory. No matter how small your repo is, writing down the risk tiers and merge conditions has extremely high ROI. A single JSON eliminates endless arguments about “does this PR need a review.”
SHA Discipline is non-negotiable. I previously cut this corner out of laziness, and the result: a bug introduced by an Agent’s fix commit went straight into main. Painful lesson.
Preflight Gate actually saves money. I did a rough calculation—if our repo had a preflight gate, we could have saved about 30% of last month’s CI bill.

What I’m less sure about:

Convergence of the Remediation Loop. Agent fixes → review finds more problems → Agent fixes again → review finds more problems… when does this loop stop? Ryan doesn’t mention a max retry or circuit breaker. In my experience, the LLM’s fix itself introduces new problems (I wrote before about “after the fix ran through, round two surfaced two more high-severity issues”). Infinite loops are a real risk.
Cost-benefit for small teams. Ryan’s at OpenAI—resources are abundant. But a 2-3 person startup building this whole thing is a non-trivial monthly spend just on Greptile + Codex + CI. Do you need all eight steps? Or can you pick priorities? My suggestion is to start with 1 (contract), 2 (preflight), 3 (SHA discipline). Those three are almost zero cost, pure discipline.
Generalizability of Browser Evidence. If your product isn’t a Web UI but an API or CLI, what does browser evidence become? API response assertions? CLI output snapshots? Ryan doesn’t expand on this.

Not a conclusion

Last time I said “Make CI/CD Great Again” isn’t a slogan. Ryan Carson used his repo to prove the next line:

CI/CD doesn’t just need to be Great Again—it needs to become a complete control plane.

From risk contract to harness gap loop, from preflight gate to remediation loop—this architecture lets the Agent not just “write code” but “be safely caught by the repo.”

How complete your repo’s control plane is decides how fast you can let the Agent run.

Letting the Agent run in a repo with no SHA discipline is like letting a car with no ABS drive 200 km/h—not that it can’t, but sooner or later it’s going to crash.

Further reading: