Models can write as well as any human, given the circumstances, often better, if you’re willing to define what “better” means and spend the time to hold them to it. They also over-engineer, introduce bugs, make wrong assumptions, and get lazy. Same as the rest of us.
When an engineer introduces a bug, we don’t stop and analyze everything they read in the past two months, re-run their education, or drill their brain so they never err again (though some of my colleagues have tried). Understanding root cause matters (skill, communication, context) but you would (and in the last 20 years we did) spend far more time and effort on catching and preventing those errors. We should do the same with model-backed agents: improve the process, not the black box.
It seems everyone has settled on some form of planning as the standard way to make agents more reliable. We discovered that too last year. We think a stricter version (research, then plan, then implement) helps us build better and more reliable software. So we broke our “coding” step into three: research, plan, and implement.
Research → Plan → Implement
We run three steps in order when the work is non-trivial: which is admittedly hard to define precisely, but in practice covers any new feature, refactoring, or service logic change. Here, dear reader, it is up to you to define what trivial is.
Research.
How we run it. You give research-codebase a well-defined question: usually the issue itself is enough, though engineers often provide a bit of extra context. It traverses the codebase (or codebases, when you launch it from a folder that contains multiple repos) and synthesizes findings into an index.md under docs/research/{issue-name}/, with two sections that feed the next step: Constraints for Future Work and Key Integration Points. We extend or reuse existing research when the area has been covered before; we don’t redo it from scratch for every new plan. Each research document records the git commit hash it was based on, which makes staleness easy to detect — the model can compare the current state of the codebase against what the research describes and flag what has drifted.
Why we run it before each plan. The code doesn’t live in the engineers’ heads anymore: it lives in the repo, and it keeps growing, fast. Research grounds and anchors the issue to the specific parts of the application, sometimes spanning more than one repo. Without it, the plan floats: when the code changes you lose what the previous plans were based on. The research also helps other agents and humans understand the codebase, independently of the plan at hand.
The research command is explicitly made aware that it is part of a process: its output will be consumed directly by generate-plan.
Plan.
How we run it. We give generate-plan the Linear issue (or a short description). It reads the matching research, absorbs the Constraints and Integration Points, and reads the project’s architecture rules before it does anything else. Then it asks 3–7 targeted questions (meaningful inquiries that could fundamentally change the design, or edge cases that affect error handling) and stops once it understands the happy path, the failure scenarios, and the domain boundaries. Then it produces one plan file under docs/plans/{feature-name}.plan.md.
Why the format matters. The command has one explicit rule: requirements before architecture. Architecture serves the requirements, not the other way around. Every requirement gets an ID (REQ-01, REQ-02, …) and at least two acceptance criteria. The plan is validated against project architecture patterns before it’s done: domain boundaries, endpoint conventions, test strategy. The design philosophy is stated plainly: plans must be clear enough for humans to approve and precise enough for AI to implement against. That dual constraint is what keeps the plan from being either a vague brief or a pile of pseudocode.
Implement.
With a plan file in place, we run implement-plan. It loads the plan and auto-discovers related research by feature-name convention. It follows project rules (from .cursor/rules/), breaks work into TODOs from the plan, implements in dependency order, runs lint and tests, and does not commit or push unless asked. When we care about real end-to-end behavior (multiple services, message queues, external APIs) we follow up with test-implementation: bring up services locally, hit the real entry points, check against the plan.
The chain is held together by conventions. generate-plan looks up research by topic/feature keywords. implement-plan discovers research from the plan filename. So docs/research/authentication-across-services/ and docs/plans/user-auth-unification.plan.md link up by naming overlap or by explicit related: research: in the plan frontmatter. No magic, just file locations and a bit of discipline.
How it looks in practice
The plan is most of the time the only artifact a human reviews: that’s an intentional trust decision, not an oversight. We believe the right place to exercise human judgment is before implementation begins, not after. Reviewing generated code line-by-line adds friction without adding much signal; reviewing the plan is where a senior engineer’s time actually compounds. docs/plans/ is listed in CODEOWNERS in GitHub. When a plan is pushed, it triggers a pull request assigned to one of the engineering leads. Once approved, the code generated by implement-plan is reviewed only by automated CodeRabbit rules (no human in the loop). The plan approval is the human checkpoint. Everything downstream runs on it.
Implement typically delivers 60–70% of the solution: the scaffolding, the happy path, the boilerplate that would otherwise burn an afternoon. The remaining 30–40% is the work that still requires an engineer — the judgment calls, the integration quirks, the edge cases the plan didn’t fully anticipate. If the agent delivers significantly less than that, it’s usually a signal to look upstream: the plan is underspecified, or the research missed something. We treat low implementation yield as a process smell, not a model failure.
The Result
It’s not perfect. Sometimes the plan is stale by the time we implement and we have to adjust. But the improvement that surprised us most was observability. Research made the model’s reasoning visible: we could see why it built the plan the way it did, and small adjustments to the research produced meaningfully different plans. That feedback loop made the whole process more predictable. Before, a plan might miss that the same functionality already existed in a utility class and try to implement it again from scratch. Now that kind of gap shows up at plan review, not after the code is generated.
The combination of plan-first and automated code review gave us back a sense of control. Planning is where we make decisions. CodeRabbit is where generated code gets checked. Human engineers spend their time on what actually requires judgment. We know, roughly, what the models will produce — and when we don’t, it’s usually because the research or the plan needs more work, not because the model surprised us.
One other thing worth noting: a single research document can feed multiple plans. We can generate several plan variants, compare them, and merge the best elements before handing off to implementation. The same pattern applies to implementation itself. That’s not the default workflow yet, but it’s where this is heading.
This post is part of the SDLC2 series. It builds on Agents as Code, where we introduced the agent-resources repo and the research / plan / implement commands.