AI Code Review: My Bot Found 47 Bugs (12 Were Real)

I got tired of manual code reviews taking forever, so I built an AI code reviewer. It would scan every PR, flag issues, and free up senior developers for real work.

The experiment ran for a month. The AI found 47 "bugs." After manual review, 12 were actual issues. The rest ranged from false positives to hilariously wrong interpretations.

Here's what I learned about AI code review: what it catches, what it misses, and whether it's actually worth using.

The Setup

I built a GitHub Action that:

Triggers on every PR
Extracts the diff
Sends code to GPT-4 with a review prompt
Posts comments on flagged lines
Creates a summary comment

yaml

# .github/workflows/ai-review.yml
name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      
      - name: Get diff
        run: |
          git diff origin/main...HEAD > diff.txt
      
      - name: AI Review
        run: node scripts/ai-review.js
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

The review script:

typescript

const reviewPrompt = `
You are a senior code reviewer. Review this diff for:

1. **Bugs**: Logic errors, null pointer issues, race conditions
2. **Security**: SQL injection, XSS, auth bypasses, exposed secrets
3. **Performance**: N+1 queries, unnecessary re-renders, memory leaks
4. **Best Practices**: Error handling, type safety, code clarity

For each issue, provide:
- Severity (critical/high/medium/low)
- File and line number
- Description of the issue
- Suggested fix

Only flag real issues. Do not flag style preferences.
Respond in JSON format.

Diff:
${diff}
`;

The Results: A Taxonomy of AI Opinions

Category 1: Actually Useful Findings (12 bugs)

These were real issues the AI caught:

Bug #1: Missing null check

typescript

// AI flagged this
function getUserName(user) {
  return user.profile.name; // user.profile could be undefined
}

Legitimate catch. We'd see this crash in production.

Bug #2: SQL injection risk

typescript

// AI flagged this
const query = `SELECT * FROM users WHERE id = '${userId}'`;

100% correct. This is a security vulnerability.

Bug #3: Unhandled promise rejection

typescript

// AI flagged this  
async function fetchData() {
  const response = await fetch(url);
  return response.json(); // No error handling
}

Valid point. Production code needs try/catch or error boundaries.

Bug #4: Race condition

typescript

// AI flagged this
let cache = null;

async function getData() {
  if (!cache) {
    cache = await fetchFromAPI(); // Multiple calls can start before first completes
  }
  return cache;
}

Good catch. Concurrent calls could all trigger fetches.

Category 2: True But Unhelpful (15 findings)

These were technically correct but not actionable:

Finding: "Consider adding input validation"

typescript

function processOrder(orderId: string) {
  // AI wants validation here
  return db.orders.findOne({ id: orderId });
}

The function takes a typed string. TypeScript already ensures it's a string. Adding runtime validation adds noise without value.

Finding: "This could throw an error"

typescript

JSON.parse(data);

Yes, it could. But it's already inside a try/catch two levels up. The AI couldn't see the context.

Finding: "Magic number detected"

typescript

const pageSize = 20;

It's a constant. It's named. It's fine. Not every number needs to be in a config file.

Category 3: Style Preferences Disguised as Issues (11 findings)

These were the AI imposing opinions:

"Use optional chaining instead"

typescript

// AI wanted this changed
if (user && user.profile && user.profile.name) {}
// To this
if (user?.profile?.name) {}

Fine suggestion, but not a bug. And our codebase had a mix of styles.

"Prefer const over let"

typescript

let count = 0;
// ... later
count += 1; // Actually mutated, let is correct

The AI saw let and flagged it without noticing the mutation below.

"Consider using destructuring"

typescript

const name = props.name;
const email = props.email;

Personal preference. Not an issue.

Category 4: Completely Wrong (9 findings)

These were embarrassing AI mistakes:

False positive: "Unused variable"

typescript

const { data, error } = useQuery();
// AI said `data` was unused

return <div>{data?.name}</div>; // AI couldn't see JSX reference

The AI parsed the diff incorrectly and missed the JSX usage.

False positive: "Missing await"

typescript

function syncOperation() {
  return someComputedValue; // Not async, doesn't need await
}

The AI assumed every function returning something should be awaited.

Hallucination: "Import not found"

typescript

import { formatDate } from '@/utils/date';
// AI: "The module '@/utils/date' doesn't exist"

It does exist. The AI just couldn't see our file structure.

What AI Code Review Actually Catches

After a month, here's the pattern:

Reliably Catches

Obvious security issues: SQL injection, XSS, hardcoded secrets
Null/undefined risks: Missing checks before property access
Basic type errors: String where number expected (in JS)
Missing error handling: Unhandled promises, no try/catch
Simple logic errors: Off-by-one, wrong comparison operators

Sometimes Catches

Race conditions: If they're simple and explicit
Performance issues: Obvious N+1 patterns
Incomplete implementations: Missing cases in switch statements

Rarely Catches

Business logic errors: AI doesn't know your requirements
Architectural problems: Wrong patterns, poor abstractions
Complex bugs: Anything requiring understanding of program flow
Context-dependent issues: Depends on code elsewhere

Never Catches

Design problems: AI can't tell if your API makes sense
Requirement mismatches: Did you build the right thing?
Scalability issues: Will this work with 10x load?
Maintainability: Is this code easy to change?

Making AI Review Actually Useful

Based on my experiment, here's how to get value:

1. Narrow the scope

Don't ask AI to review everything. Focus on what it's good at:

typescript

const focusedPrompt = `
Review ONLY for:
- Security vulnerabilities
- Unhandled errors
- Null pointer risks

Do NOT comment on:
- Code style
- Naming conventions
- Performance (unless severe)
`;

Fewer false positives, more signal.

2. Provide context

The AI can't see your whole codebase. Give it what it needs:

typescript

const contextualPrompt = `
Project context:
- TypeScript with strict mode
- All database calls are already wrapped in error handlers
- We use path aliases (@/) for imports
- React Query handles API caching

Review this diff:
${diff}
`;

3. Require confidence levels

Make the AI self-assess:

typescript

const confidentPrompt = `
For each issue, rate your confidence:
- HIGH: I'm certain this is a bug
- MEDIUM: This looks problematic but might be intentional
- LOW: This is a suggestion, not a bug

Only report HIGH and MEDIUM issues.
`;

This cuts down on noise significantly.

4. Two-pass review

First pass: AI review for automated catches Second pass: Human review for everything else

Don't let AI review replace human review. Let it pre-filter.

5. Tune over time

Track false positives and adjust:

typescript

// After a month, I added this to my prompt:
const refinedPrompt = `
Known exceptions (do NOT flag these):
- useQuery destructuring where data appears unused
- Try/catch is often in parent components
- Path aliases (@/) are valid in our config
`;

The Real ROI

After a month:

Time saved: ~30 minutes/week (quick catches AI finds)
Time lost: ~45 minutes/week (reviewing false positives)
Cost: ~$50/month in API calls

Net result: Slightly negative ROI

But here's the thing—four of those 12 real bugs were security issues. Without AI review, they might have shipped. One SQL injection in production is worth more than a year of AI review costs.

My Revised Approach

I still use AI code review, but differently:

Automated pipeline:

yaml

# Only runs on:
# - Files touching auth/security
# - Database queries
# - External API integrations

Prompt focuses on:

Security vulnerabilities only
High confidence findings only
No style feedback

Human reviewers:

Still do full reviews
AI is a safety net, not a replacement
Explicitly told to ignore AI style comments

When to Use AI Code Review

Good fit:

Large teams with varied experience levels
Security-sensitive codebases
High PR volume where humans can't review everything
Compliance requirements for security scanning

Bad fit:

Small teams who know the codebase
Rapid prototyping
When you'd spend more time configuring than reviewing
If false positives would frustrate your team

Tools to Try

If you don't want to build your own:

CodeRabbit: Full-featured AI review
Sourcery: Python-focused
GitHub Copilot PR review: Built into Copilot Enterprise
Amazon CodeGuru: AWS's offering

Or build your own for maximum control (what I recommend if you have time).

The Bottom Line

AI code review is useful but overhyped. It catches surface-level issues reliably—null checks, obvious security holes, missing error handling. It misses everything requiring context—business logic, architecture, requirements.

The math is simple: If AI catches one critical bug per month that humans would miss, it's worth it. In my experience, it catches about one every two months. Your mileage may vary based on your team's review rigor and codebase complexity.

Use it as a safety net, not a replacement. Narrow its focus to what it's good at. And never, ever trust it without human verification.

My bot found 47 bugs. 12 were real. That's a 25% accuracy rate. Would you trust a human reviewer with that track record?

Me neither. But as a first-pass filter? It's fine. Just calibrate your expectations accordingly.

---

AI tools amplify human capability; they don't replace human judgment. Use them to catch what you'd miss, not to avoid doing the work.

AI Code Review: My Bot Found 47 Bugs (12 Were Real)

AI Code Review: My Bot Found 47 Bugs (12 Were Real)

The Setup

The Results: A Taxonomy of AI Opinions

Category 1: Actually Useful Findings (12 bugs)

Category 2: True But Unhelpful (15 findings)

Category 3: Style Preferences Disguised as Issues (11 findings)

Category 4: Completely Wrong (9 findings)

What AI Code Review Actually Catches

Reliably Catches

Sometimes Catches

Rarely Catches

Never Catches

Making AI Review Actually Useful

1. Narrow the scope

2. Provide context

3. Require confidence levels

4. Two-pass review

5. Tune over time

The Real ROI

My Revised Approach

When to Use AI Code Review

Tools to Try

The Bottom Line

Written by Jose Viscasillas

Recommended Reads

Clawdbot: Your AI Assistant in Your Pocket (Not Just Another Chatbot)

AI Coding Assistants: What They're Actually Good At

AI Agents Are Overhyped But I'm Building One Anyway

Subscribe to the Newsletter