All blogs

Jun 19, 2025

Learnings from building AI agents

How we made our AI code reviewer stop being so noisy

Paul Sangle-Ferriere

I’m Paul, cofounder of cubic—an "AI-native GitHub." One of our core features is an AI code review agent that performs an initial review pass, catching bugs, anti-patterns, duplicated code, and similar issues in pull requests.

When we first released this agent back in April, the main feedback we got was straightforward: it was too noisy.

Even small PRs often ended up flooded with multiple low-value comments, nitpicks, or outright false positives. Rather than helping reviewers, it cluttered discussions and obscured genuinely valuable feedback.

An example nitpick

We decided to take a step back and thoroughly investigate why this was happening.

After three major architecture revisions and extensive offline testing, we managed to reduce false positives by 51% without sacrificing recall.

Many of these lessons turned out to be broadly useful—not just for code review agents but for designing effective AI systems in general.

1. The Face‑Palm Phase: A Single, Do‑Everything Agent

Our initial architecture was straightforward but problematic:

[diff]
↓
[single large prompt with contextual codebase info]
↓
[list of comments]

It looked clean in theory but quickly fell apart in practice:

Excessive false positives: The agent often mistook style issues for critical bugs, flagged resolved issues, and repeated suggestions our linters had already addressed.
Users lost trust: Developers quickly learned to ignore the comments altogether. When half the comments feel irrelevant, the truly important ones get missed.
Opaque reasoning: Understanding why the agent made specific calls was practically impossible. Even explicit prompts like "ignore minor style issues" had minimal effect.

We tried standard solutions—longer prompts, adjusting the model's temperature, experimenting with sampling—but saw little meaningful improvement.

2. What Finally Worked

After extensive trial-and-error, we developed an architecture that significantly improved results and proved effective in real-world repositories. These solutions underpin the 51% reduction in false positives currently running in production.

2.1 Explicit Reasoning Logs

We required the AI to explicitly state its reasoning before providing any feedback:

{
  "reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47",
  "finding": "Possible nil‑pointer dereference",
  "confidence": 0.81
}

This approach provided critical benefits:

Enabled us to clearly trace the AI’s decision-making process. If reasoning was flawed, we could quickly identify and exclude the pattern in future iterations.
Encouraged structured thinking by forcing the AI to justify its findings first, significantly reducing arbitrary conclusions.
Created a foundation to diagnose and resolve root causes behind other issues we faced.

2.2 Fewer, Smarter Tools

Initially, the agent had extensive tooling—Language Server Protocol (LSP), static analysis, test runners, and more. However, explicit reasoning logs revealed most analyses relied on a few core tools, with extra complexity causing confusion and mistakes.

We streamlined the toolkit to essential components only—a simplified LSP and a basic terminal.

With fewer distractions, the agent spent more energy confirming genuine issues, significantly improving precision.

2.3 Specialized Micro-Agents Over Generalized Rules

Initially, our instinct was to continuously add more rules into a single large prompt to handle edge cases:

“Ignore unused variables in .test.ts files.”
“Skip import checks in Python’s init.py.”
“Don't lint markdown files.”

This rapidly became unsustainable and was largely ineffective as the AI frequently overlooked many rules.

Our breakthrough came from employing specialized micro-agents, each handling a narrowly-defined scope:

Planner: Quickly assesses changes and identifies necessary checks.
Security Agent: Detects vulnerabilities such as injection or insecure authentication.
Duplication Agent: Flags repeated or copied code.
Editorial Agent: Handles typos and documentation consistency.
etc…

Specializing allowed each agent to maintain a focused context, keeping token usage efficient and precision high. The main trade-off was increased token consumption due to overlapping context, managed through effective caching strategies.

3. Real-world Outcomes

These architecture and prompt improvements led to meaningful results across hundreds of real pull requests from active open-source and private repositories. Specifically, over the past six weeks:

51% fewer false positives, directly increasing developer trust and usability.
Median comments per pull request cut by half, helping teams concentrate on genuinely important issues.
Teams reported notably smoother review processes, spending less time managing irrelevant comments and more time effectively merging changes.

Additionally, the reduced noise significantly improved developer confidence and engagement, making reviews faster and more impactful.

4. Key Lessons

Explicit reasoning improves clarity. Require your AI to clearly explain its rationale first—this boosts accuracy and simplifies debugging.
Simplify the toolset. Regularly evaluate your agent's toolkit and remove tools rarely used (less than 10% of tasks).
Specialize with micro-agents. Keep each AI agent tightly focused on a single task, reducing cognitive overload and enhancing precision.

All blogs

Jun 19, 2025

Learnings from building AI agents

How we made our AI code reviewer stop being so noisy

Paul Sangle-Ferriere

When we first released this agent back in April, the main feedback we got was straightforward: it was too noisy.

An example nitpick

We decided to take a step back and thoroughly investigate why this was happening.

After three major architecture revisions and extensive offline testing, we managed to reduce false positives by 51% without sacrificing recall.

Many of these lessons turned out to be broadly useful—not just for code review agents but for designing effective AI systems in general.

1. The Face‑Palm Phase: A Single, Do‑Everything Agent

Our initial architecture was straightforward but problematic:

[diff]
↓
[single large prompt with contextual codebase info]
↓
[list of comments]

It looked clean in theory but quickly fell apart in practice:

Excessive false positives: The agent often mistook style issues for critical bugs, flagged resolved issues, and repeated suggestions our linters had already addressed.
Users lost trust: Developers quickly learned to ignore the comments altogether. When half the comments feel irrelevant, the truly important ones get missed.
Opaque reasoning: Understanding why the agent made specific calls was practically impossible. Even explicit prompts like "ignore minor style issues" had minimal effect.

We tried standard solutions—longer prompts, adjusting the model's temperature, experimenting with sampling—but saw little meaningful improvement.

2. What Finally Worked

2.1 Explicit Reasoning Logs

We required the AI to explicitly state its reasoning before providing any feedback:

{
  "reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47",
  "finding": "Possible nil‑pointer dereference",
  "confidence": 0.81
}

This approach provided critical benefits:

Enabled us to clearly trace the AI’s decision-making process. If reasoning was flawed, we could quickly identify and exclude the pattern in future iterations.
Encouraged structured thinking by forcing the AI to justify its findings first, significantly reducing arbitrary conclusions.
Created a foundation to diagnose and resolve root causes behind other issues we faced.

2.2 Fewer, Smarter Tools

We streamlined the toolkit to essential components only—a simplified LSP and a basic terminal.

With fewer distractions, the agent spent more energy confirming genuine issues, significantly improving precision.

2.3 Specialized Micro-Agents Over Generalized Rules

Initially, our instinct was to continuously add more rules into a single large prompt to handle edge cases:

“Ignore unused variables in .test.ts files.”
“Skip import checks in Python’s init.py.”
“Don't lint markdown files.”

This rapidly became unsustainable and was largely ineffective as the AI frequently overlooked many rules.

Our breakthrough came from employing specialized micro-agents, each handling a narrowly-defined scope:

Planner: Quickly assesses changes and identifies necessary checks.
Security Agent: Detects vulnerabilities such as injection or insecure authentication.
Duplication Agent: Flags repeated or copied code.
Editorial Agent: Handles typos and documentation consistency.
etc…

3. Real-world Outcomes

These architecture and prompt improvements led to meaningful results across hundreds of real pull requests from active open-source and private repositories. Specifically, over the past six weeks:

51% fewer false positives, directly increasing developer trust and usability.
Median comments per pull request cut by half, helping teams concentrate on genuinely important issues.
Teams reported notably smoother review processes, spending less time managing irrelevant comments and more time effectively merging changes.

Additionally, the reduced noise significantly improved developer confidence and engagement, making reviews faster and more impactful.

4. Key Lessons

Explicit reasoning improves clarity. Require your AI to clearly explain its rationale first—this boosts accuracy and simplifies debugging.
Simplify the toolset. Regularly evaluate your agent's toolkit and remove tools rarely used (less than 10% of tasks).
Specialize with micro-agents. Keep each AI agent tightly focused on a single task, reducing cognitive overload and enhancing precision.

All blogs