Harness Engineering Fully Decoded: When AI Agents Finish Writing Code, Is Your Repo Ready to Catch It Automatically?
Disclaimer: This post is machine-translated from the original Chinese article: https://ai-coding.wiselychen.com/harness-engineering-control-plane-pattern-agent-review-loop/
The original work is written in Chinese; the English version is translated by AI.
📝 Originally written in Chinese. Read the original →

First, the source: what exactly is OpenAI’s Harness Engineering?
In early February, OpenAI’s engineering team published a blog post: Harness engineering: leveraging Codex in an agent-first world, written by Ryan Lopopolo.
This post describes an extreme experiment: a 3-person engineering team, 5 months, 1 million lines of code shipped, 0 lines written by humans. All the code—application logic, tests, CI config, docs, observability, internal tooling—all generated by Codex. They estimate this saved 10x the time.
A few key numbers:
- 1,500 PRs opened and merged in 5 months
- Average of 3.5 PRs per engineer per day, and as the team grew from 3 to 7 people, throughput kept rising
- A single Codex run can keep working for 6+ hours (usually while humans sleep)
- The team used to spend 20% of every Friday cleaning up “AI slop”—later, they automated that too
The core philosophy is eight words: Humans steer. Agents execute.
And Peter Steinberger (author of OpenClaw) might be the most extreme witness of this philosophy in the wild. Someone tallied his Top 5 single-person single-day commit counts:
| Date | Peter’s solo commit count |
|---|---|
| Feb 22 (Sun) | 627 |
| Feb 16 (Mon) | 490 |
| Feb 15 (Sun) | 461 |
| Feb 14 (Sat) | 447 |
| Feb 21 (Sat) | 315 |
A day has only 1,440 minutes. 627 commits—no eating, drinking, sleeping, or bathroom breaks—averages a commit every 2.3 minutes.
This is obviously not a human hand-writing code. This is the real output of Codex + Harness running at full tilt. Peter previously ran 50 Codex instances in parallel reviewing 3,000 PRs, which was already absurd. Now his daily commit volume directly proves one thing: when your repo has a sufficiently complete harness, the ceiling on Agent output isn’t the model’s capability—it’s whether your control plane can catch it.
But the most valuable thing in this blog post isn’t the numbers. It’s the entire methodology they worked out for “how to make Agents reliably work inside a repo.” They call this Harness Engineering—not the engineering of writing code, but the engineering of building frameworks, constraints, and feedback loops.
Specifically, it includes several core insights:
1. The Repository is the System of Record
They tried stuffing all instructions into one giant AGENTS.md. It failed. The reason is blunt: context is a scarce resource, “everything important” means nothing is important, and a big file goes stale in an instant.
Their final approach: treat AGENTS.md as a table of contents (around 100 lines), pointing to a structured knowledge base in docs/. All Slack discussions, architectural decisions, design principles—everything must be deposited into the repo. What the Agent can’t see, doesn’t exist.
2. Use architectural constraints to box in the Agent, not micromanage the implementation
They built a strict layered architecture: Types → Config → Repo → Service → Runtime → UI, where each layer can only depend forward, never backward. Violations get automatically blocked.
The key: these constraints are enforced with custom linters and structural tests, and the linter error messages embed the fix instructions directly. The moment the Agent makes a mistake, how to fix it is already injected into the context.
3. Entropy Management is continuous engineering
Agents will faithfully copy existing patterns in the repo—including bad ones. An anti-pattern gets replicated to multiple modules within days. Their solution: run background Codex tasks periodically to scan drift, update quality scores, open refactor PRs. Continuous cleanup like garbage collection, not saved up for a painful one-shot later.
4. Let the Agent “see” the running state of the application
They made the application startable independently from each git worktree, hooked up to Chrome DevTools Protocol, so Codex can directly take screenshots, manipulate the DOM, query logs, and check metrics. The Agent doesn’t just write code—it can run the app itself, verify itself, and fix bugs itself.
This post generated a huge response in the community. Ryan Carson (@ryancarson), inspired by it, translated the Harness Engineering philosophy into a concrete, reproducible control-plane pattern—a complete control loop from PR open to merge.
His goal is even more blunt:
“I’ve been grinding with Codex (on Extra High) through setting up our repo for Harness Engineering. The goal is to have Codex write and review 100% of the code.”
From Harness Engineering to the Control-Plane Pattern
The OpenAI post tells you “why build a framework” and “what framework to build.” Ryan Carson tells you how to build it—down to every GitHub workflow, every line of TypeScript, every edge case.
Let me unpack it layer by layer.
Step 1: A machine-readable Contract
The first thing Ryan did wasn’t writing tests or configuring CI. It was writing a JSON contract.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
"version": "1",
"riskTierRules": {
"high": [
"app/api/legal-chat/**",
"lib/tools/**",
"db/schema.ts"
],
"low": ["**"]
},
"mergePolicy": {
"high": {
"requiredChecks": [
"risk-policy-gate",
"harness-smoke",
"Browser Evidence",
"CI Pipeline"
]
},
"low": {
"requiredChecks": ["risk-policy-gate", "CI Pipeline"]
}
}
}
This contract does three things:
- Defines risk tiers: which paths are high-risk (API, DB schema, tool functions), which are low-risk
- Defines merge conditions: high-risk paths must pass four checks, low-risk only two
- Eliminates ambiguity: all rules live in one place, not scattered across workflow files, scripts, and docs
Why does this matter? Because when you have 50 Codex instances running at the same time (yes, Peter Steinberger scale), you can’t rely on humans to remember “this directory changed, which checks do we need.” The rules have to be machine-readable, and there has to be exactly one copy.
One contract eliminates silent drift between scripts, workflows, and docs.
Step 2: Preflight Gate — Block first, run later
This is one of Ryan’s smartest designs, in my opinion.
The traditional approach: a PR opens, and CI fires everything—tests, builds, security scans, code review all start in parallel. After everything finishes, you see what passed and what didn’t.
Ryan’s approach: run the preflight gate first. Only if it passes do you start running the expensive CI.
1
2
3
4
5
6
7
8
const requiredChecks = computeRequiredChecks(changedFiles, riskTier);
await assertDocsDriftRules(changedFiles);
await assertRequiredChecksSuccessful(requiredChecks);
if (needsCodeReviewAgent(changedFiles, riskTier)) {
await waitForCodeReviewCompletion({ headSha, timeoutMinutes: 20 });
await assertNoActionableFindingsForHead(headSha);
}
The logic is simple:
- First check what files changed, determine the risk tier
- Confirm docs aren’t drifting
- If a code review agent is needed, wait for it to finish
- Only after all that passes, let test/build/security start running
What you save isn’t just CI minutes. When you open 50 PRs a day, each PR’s full CI takes 15 minutes, that’s 750 minutes of CI time per day. If 30% of PRs get blocked at preflight, you save 225 minutes a day. Over 100 hours a month.
But more importantly: preflight gate enforces deterministic ordering. Policy checks first, then review, then CI. This order cannot be scrambled.
Step 3: SHA Discipline — Ryan says this is the biggest practical lesson
Ryan said it himself in this section:
“This was the biggest practical lesson from real PR loops.”
Here’s the problem: the code review agent finished running on commit A and said “clean.” Then the Agent pushed a fix commit, and now HEAD is commit B. You take commit A’s review result to decide whether to merge commit B—that’s wrong.
You’re using old “clean” evidence to green-light new code. That’s the same as not reviewing at all.
Ryan’s rules:
- Review status is only valid when it matches the current PR HEAD commit
- Ignore all summary comments bound to old SHAs
- After every push/synchronize, review must be re-run
- If the most recent review run isn’t success or it times out, fail immediately
This looks like a small thing, but in high-frequency Agent push scenarios, this is the lifeline of whether your merge results can be trusted.
I hit the same problem using Claude Code. After the Agent fixed the first round of review issues and pushed new code, the fix itself introduced new issues. If I hadn’t re-run the review, that new problem would have gone straight into main.
Step 4: SHA Dedupe on Rerun Comments
This is a very practical engineering problem.
When multiple workflows can all trigger a review rerun, a pile of duplicate bot comments appear under the PR. Worse is the race condition—two workflows send rerun requests at the same time, the review agent runs twice, and the results overwrite each other.
Ryan’s solution is pure engineering: only one canonical workflow issues rerun requests, with dedupe by marker + SHA.
1
2
3
4
5
6
7
8
9
const marker = '<!-- review-agent-auto-rerun -->';
const trigger = `sha:${headSha}`;
const alreadyRequested = comments.some((c) =>
c.body.includes(marker) && c.body.includes(trigger),
);
if (!alreadyRequested) {
postComment(`${marker}\n@review-agent please re-review\n${trigger}`);
}
HTML comment as the marker (users can’t see it), SHA as the dedupe key. The same HEAD never gets rerun twice.
This isn’t the kind of thing you’ll see in any architecture document. But the moment you actually run it, you’ll hit it.
Step 5: Automated Remediation Loop — Let the Agent fix it itself
So far we’ve solved “how to block” and “how to review.” But there’s still a question: after a problem is found, who fixes it?
Traditional answer: humans.
Ryan’s answer: let the coding agent read the review context, patch itself, run local validation itself, push the fix commit to the same PR branch itself.
Then the PR’s synchronize event triggers the normal rerun flow. A perfect closed loop.
But Ryan added three guardrails:
- Pin model + effort: fix the model version and effort level, guarantee reproducibility
- Skip stale comments: ignore old comments that don’t match the current HEAD
- Never bypass policy gates: the remediation agent also has to pass every gate—no special privileges
The third point is the key. If you let the remediation agent bypass the policy gate, you’ve opened a hole in your control loop. The Agent can push any code under the excuse of “I’m fixing a bug.”
Step 6: Auto Resolve of Bot Threads
The review bot opens a pile of conversation threads. If the bot’s issue has already been fixed in a new commit, these threads should be auto-resolved, otherwise GitHub’s required conversation resolution will block merge.
But Ryan added one important condition:
- Only auto-resolve threads where every comment is from a bot
- Never auto-resolve threads where a human has participated
Why? Because a human comment represents human judgment and intent. Auto-resolving on behalf of a human is making the decision for them.
After resolving, run the policy gate again to ensure the conversation resolution state is current.
Step 7: Browser Evidence — Screenshots aren’t enough, it has to be first-class proof
Reviewing UI changes is the easiest place to cut corners. A lot of teams just “paste a screenshot into the PR.”
Ryan’s bar is higher: browser evidence must be a first-class artifact in CI, with manifest and assertions.
1
2
npm run harness:ui:capture-browser-evidence
npm run harness:ui:verify-browser-evidence
The verification covers:
- All required flows were exercised
- The correct entrypoint was used
- The login flow used the correct account identity
- The artifact is fresh and valid
A screenshot is a snapshot. Evidence is verifiable proof. The difference shows up especially clearly when you’re doing compliance audits.
Step 8: Harness Gap Loop — Memorize incidents
The last one, and in my opinion the easiest to overlook:
1
production regression → harness gap issue → case added → SLA tracked
Every production incident isn’t just fixed and forgotten. Instead:
- Open a harness gap issue
- Turn the reproduction condition into a test case
- Add it to the harness
- Track the SLA (how long to fix, how long to add the case)
This ensures the fix doesn’t become a one-off patch. The same problem doesn’t happen a second time, and long-term test coverage is growing.
The Complete Control-Plane Pattern
String the eight steps together and you have this complete control plane:
| Step | Function | Determinism |
|---|---|---|
| 1. Risk Contract | Define rules, eliminate ambiguity | Fully deterministic |
| 2. Preflight Gate | Block before running, save CI cost | Fully deterministic |
| 3. SHA Discipline | Only trust evidence from current HEAD | Fully deterministic |
| 4. Rerun Dedupe | One canonical writer, no duplicates | Fully deterministic |
| 5. Remediation Loop | Agent fixes itself, no gate bypass | Semi-deterministic (model is non-deterministic) |
| 6. Bot Thread Resolve | Auto-clean bot threads, leave human ones alone | Fully deterministic |
| 7. Browser Evidence | UI evidence as CI artifact | Fully deterministic |
| 8. Harness Gap Loop | Incidents turn into test cases | Fully deterministic |
Notice? 7 out of 8 steps are fully deterministic. Only the remediation loop has an LLM in it—everything else is pass/fail logic machines can evaluate.
This is completely aligned with the four-layer defense view I wrote before: use deterministic tools to box in non-deterministic AI.
How does this relate to the four-layer defense from last time?
Last time, the four-layer defense (Test → Lint → CI Gate → LLM Judge) was vertical—each layer digs deeper into checking a single PR.
Ryan’s control-plane pattern is horizontal—the complete lifecycle from PR open to merge.
The two aren’t in conflict—they’re complementary. The four-layer defense is one component inside the control plane. Specifically, it’s what runs inside Step 2 Preflight Gate and Step 5 Remediation Loop.
Putting them together:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PR opens
│
├── Step 1: Risk Contract → determine risk tier
├── Step 2: Preflight Gate
│ ├── Layer 1: Test
│ ├── Layer 2: Lint + Type Check
│ ├── Layer 3: CI Gate (coverage, security)
│ └── Layer 4: LLM Judge (code review agent)
├── Step 3: SHA Discipline → confirm evidence matches current HEAD
├── Step 4: Rerun Dedupe → avoid duplicate reruns
├── Step 5: Remediation Loop → Agent self-fix → back to Step 2
├── Step 6: Bot Thread Resolve → clean up resolved threads
├── Step 7: Browser Evidence → UI evidence verification
├── Step 8: Harness Gap Loop → incident turns into test case
│
└── Merge
This is a complete repo architecture that lets Agents write + review + fix code.
General pattern vs. specific implementation
Ryan specifically emphasizes this is a general pattern, not a specific toolchain.
| General concept | Ryan’s specific implementation |
|---|---|
| Code Review Agent | Greptile |
| Remediation Agent | Codex Action |
| Canonical Rerun Workflow | greptile-rerun.yml |
| Stale Thread Cleanup | greptile-auto-resolve-threads.yml |
| Preflight Policy | risk-policy-gate.yml |
You can swap Greptile for CodeRabbit, CodeQL, or a self-built LLM review. You can swap Codex for Claude Code, Cursor, Devin. The semantics of the control-plane don’t change—only the integration point does.
This is why I think this post deserves its own write-up—it’s not pitching a tool, it’s defining an architectural pattern.
Against my own hands-on experience
In “Make CI/CD Great Again” I shared my experience using three Agents to do multi-role code review. Looking back now through Ryan’s framework, what was I missing?
- No Risk Contract. My review treated every file the same. But realistically,
db/schema.tsandREADME.mdare 10x apart in risk tier. - No SHA Discipline. After my Agent fixed an issue and pushed a new commit, I didn’t enforce a review re-run. I was relying on human eyes to decide “it’s fixed.”
- No Remediation Loop. When the review found issues, I either fixed them myself or re-prompted the Agent. There was no automated fix → re-review loop.
- No Harness Gap Loop. When something went wrong in prod, we fixed it. No systematic conversion into a test case.
To quantify as a percentage, I had maybe 40% of Ryan’s eight steps—Test, Lint, CI Gate, LLM Judge—but the other four steps of the control plane (contract, SHA, dedupe, remediation loop) were completely missing.
This is the gap between “it runs” and “it scales.”
Tool recommendations: what do you need to build a Control-Plane?
The tech stack involved in the whole Harness Engineering setup breaks into four categories. The key point: this is a general pattern, and each category can be swapped for whatever alternative you know.
1. AI models and Agents
This is the “engine” producing and fixing code:
- OpenAI Codex: Ryan and OpenAI’s team’s main tool, supports long autonomous runs (6+ hours per run)
- Claude Code: what I use myself, good for scenarios that need deep codebase context understanding
- Other options: Cursor, Devin, Windsurf, etc.—anything that can plug into a PR workflow works
2. Automated code review
The “LLM Judge” role in the Control-Plane:
- Greptile: the code review agent used in Ryan’s concrete implementation, understands codebase semantics
- CodeRabbit: another mainstream option, which I used in “Make CI/CD Great Again”
- CodeQL: GitHub’s native static analysis, leaning toward security vulnerability detection
- Self-built LLM Review: wrap GPT-4 / Claude with a review prompt—maximum flexibility but also highest maintenance cost
3. CI/CD and infrastructure control plane
This is the actual “safety net” that catches AI output, relying mostly on the GitHub ecosystem and automation scripts:
- GitHub Actions: the execution environment for the whole control plane. Specific workflows mentioned in the post include:
risk-policy-gate.yml(Preflight gate)greptile-rerun.yml(dedupe mechanism for repeated triggers)greptile-auto-resolve-threads.yml(auto-cleanup of bot comments)
- JSON Contract: used to write a machine-readable Risk Contract, defining which directories (like
db/schema.ts) need stricter defenses - TypeScript: used for writing custom preflight logic (Preflight Gate) and dedupe logic (Marker + SHA Dedupe)
- Git primitives: deep reliance on PR
synchronizeevents, HEAD commit SHA tracking, and hidden HTML comments (<!-- marker -->) for state management
4. Application runtime and UI verification tools
To solve the “Agent can’t see the running state” problem and let the AI verify itself:
- Chrome DevTools Protocol (CDP): OpenAI’s team hooked the application up to CDP, so Codex can directly manipulate DOM, query logs, check metrics, and take screenshots
- npm scripts: used to produce Browser Evidence, e.g.
npm run harness:ui:capture-browser-evidenceandnpm run harness:ui:verify-browser-evidence - Git Worktree: lets each Agent run in its own worktree, starting apps and running tests without interfering with each other
Minimum viable tool combo
If you’re a 2-3 person small team, you don’t need everything. My suggestion is start with these three:
| Priority | Tool | Cost | Corresponding Step |
|---|---|---|---|
| P0 | JSON Risk Contract + GitHub Actions | Free | Step 1, 2 |
| P0 | Git SHA tracking scripts | Free | Step 3, 4 |
| P1 | CodeRabbit or Greptile | $19-49/month | Step 2 Layer 4 |
| P2 | Codex or Claude Code | Usage-based | Step 5 |
The first two are pure discipline, zero cost, you can do them right now. The latter two depend on team size and budget.
Honestly
Ryan’s architecture is very complete, but I have a few practical questions I haven’t figured out:
What I’m more confident about:
-
Risk Contract is mandatory. No matter how small your repo is, writing down the risk tiers and merge conditions has extremely high ROI. A single JSON eliminates endless arguments about “does this PR need a review.”
-
SHA Discipline is non-negotiable. I previously cut this corner out of laziness, and the result: a bug introduced by an Agent’s fix commit went straight into main. Painful lesson.
-
Preflight Gate actually saves money. I did a rough calculation—if our repo had a preflight gate, we could have saved about 30% of last month’s CI bill.
What I’m less sure about:
-
Convergence of the Remediation Loop. Agent fixes → review finds more problems → Agent fixes again → review finds more problems… when does this loop stop? Ryan doesn’t mention a max retry or circuit breaker. In my experience, the LLM’s fix itself introduces new problems (I wrote before about “after the fix ran through, round two surfaced two more high-severity issues”). Infinite loops are a real risk.
-
Cost-benefit for small teams. Ryan’s at OpenAI—resources are abundant. But a 2-3 person startup building this whole thing is a non-trivial monthly spend just on Greptile + Codex + CI. Do you need all eight steps? Or can you pick priorities? My suggestion is to start with 1 (contract), 2 (preflight), 3 (SHA discipline). Those three are almost zero cost, pure discipline.
-
Generalizability of Browser Evidence. If your product isn’t a Web UI but an API or CLI, what does browser evidence become? API response assertions? CLI output snapshots? Ryan doesn’t expand on this.
Not a conclusion
Last time I said “Make CI/CD Great Again” isn’t a slogan. Ryan Carson used his repo to prove the next line:
CI/CD doesn’t just need to be Great Again—it needs to become a complete control plane.
From risk contract to harness gap loop, from preflight gate to remediation loop—this architecture lets the Agent not just “write code” but “be safely caught by the repo.”
How complete your repo’s control plane is decides how fast you can let the Agent run.
Letting the Agent run in a repo with no SHA discipline is like letting a car with no ABS drive 200 km/h—not that it can’t, but sooner or later it’s going to crash.
Further reading:
- When Code Volume Explodes 10x, Who Actually Does the Review? Make CI/CD Great Again
- After 630K Lines in Three Months: In the AI Coding Era, What Is an Engineer’s Real Value?
- Half-Year Review of AI Coding: Development Didn’t Get Faster, We Just Moved the Bottleneck from Writing Code to QA and Requirements Gathering
- Security Risks of AI Coding Tools: When Prompt Injection Meets RCE