Lenny's Newsletter

Not all AI agents are created equal — a framework for prioritizing your agent backlog

UA

Unknown Author

Apr 27, 2026

10 min read

Not all AI agents are created equal — a framework for prioritizing your agent backlog

Source: Lenny's Newsletter · Authors: Hamza Farooq & Jaya Rajwani (guest post, intro by Lenny Rachitsky) · Date: Apr 14, 2026 · Original: https://www.lennysnewsletter.com/p/not-all-ai-agents-are-created-equal

Header image

Note: This post is partially paywalled. The summary below covers the freely accessible sections (the framework, Category 1, Category 2, and the opening of Category 3). The detailed Category 3 deep-dive, prioritization criteria, and tool-selection guide for multi-agent networks sit behind the paywall.


The problem: "agent" is an umbrella for very different systems

Every AI leader the authors have talked to (30+ conversations) shows up with roughly the same backlog: a PM assistant, a RAG copilot, a customer-support system, a code-review agent, a voice-enabled shopping assistant — and the same question: "Which one do we build first?"

Teams reach for the familiar impact-vs-effort matrix and immediately hit a wall. Why? Because the things on that spreadsheet aren't comparable:

  • One agent takes 6 weeks; another takes 6 months.
  • One can be assembled by a PM in n8n; another needs a dedicated ML engineering team.
  • One costs $500/month to run; another runs up a six-figure annual LLM bill.

The authors' analogy: you're "comparing apples, oranges, and jet engines on the same spreadsheet." A customer-support assistant and a voice-enabled shopping agent may share the word "agent," but they demand different architectures, teams, infrastructure, and timelines. Until you separate them by kind, any effort/impact estimate is guesswork.

The missing step: categorize before you prioritize

Hierarchy diagram

Before "what should we build first?" comes "what type of agent is each idea actually proposing?" The type determines:

  • Build complexity
  • Required skills and infrastructure
  • Timeline
  • Operating cost
  • How you should measure success

This framing aligns with the broader industry — e.g. the Levels of Autonomy for AI Agents paper and IBM's "Types of AI agents" — and it's how production tools like Claude Code are actually architected.

The framework gives you four things:

  1. A 5-minute triage to drop every agent idea into one of three architectural buckets.
  2. A guide for picking the right tool (n8n vs. LangGraph vs. Google ADK).
  3. Success metrics and ROI frameworks tailored to each bucket.
  4. Warning signs you've picked the wrong path, and how to switch.

The three categories at a glance

CategoryWho decides what happens next?Typical toolsShare of agent opportunities
1. Deterministic automationYou define every step; LLM only fills in content at specific nodesn8n, Zapier, Make.com, OpenAI AgentKit, Lindy, Gumloop60–70%
2. Reasoning & acting (ReAct)The LLM decides what to do next using tools you provideLangGraph, CrewAI, AutoGen, Google ADK25–30%
3. Multi-agent networksMultiple specialized agents coordinate with each otherGoogle ADK, AutoGenThe remainder; rarely the right starting point

The authors' biggest warning: most teams overengineer, reaching for Category 2 frameworks to solve Category 1 problems — adding cost and complexity for no benefit. Less common but worse: trying to solve Category 2 problems with Category 1 tools, which then breaks in production because the tool isn't robust enough.


Category 1 — Deterministic automation (the workhorse)

What it is

An "intelligent flowchart." You design every step, every branch, every decision point. The LLM only handles natural-language understanding/generation at specific nodes (classify this email, extract these fields, draft this reply). You control the flow; the LLM controls the content.

Tools: n8n, Zapier, Make.com, OpenAI AgentKit, Lindy, Gumloop. They're built around explicit triggers and predefined branching.

When to start here

Almost always, if you have a backlog. Category 1 projects are simplest to plan, lowest risk to execute, and produce ROI in weeks, not months. They're ideal when:

  • The process is already well-defined.
  • The work is repetitive and high-volume.
  • You have limited AI engineering capacity.
  • You're under pressure to ship in weeks.

Category 1 profile

How to recognize a Category 1 problem

If you can draw the whole process as a flowchart with clear decision points, it's Category 1. Other tells:

  • Execution paths are finite and predictable (fewer than ~15–20 branches).
  • Tasks complete in seconds to minutes.
  • The value is automating a known process, not discovering new approaches.

Worked example — customer-email triage. "An AI agent reads incoming customer emails, understands the ask, pulls relevant info from our docs, drafts a reply, and routes to a human for approval." This sounds like sophisticated reasoning. But map it out and every step is predictable: classify → retrieve → draft → route. The intelligence sits inside each step (understanding the email, generating a good reply), not in figuring out what step comes next. Category 1.

Email automation flow

Other Category 1 examples:

  • A travel-planning agent (the author built one for Airbnb search, used by 10,000+ people).
  • A voice-enabled book companion.
  • Content automation (e.g., YouTube → LinkedIn).
  • A "Perplexity-clone" knowledge base over your org's docs + the web.
  • A blog-research-and-draft agent.
  • A calorie counter that takes meal photos and recommends dietary changes.

How to evaluate it

The metrics answer one question: did this agent actually automate the right process?

  • Workflow completion rate — % of executions that finish successfully
  • Automation rate — % of requests handled with zero human touch
  • Accuracy — correctness of intent classification, extraction, routing
  • Latency — trigger-to-output time (P50/P95)
  • Cost per workflow — total LLM + API spend per run
  • Error rate — failures from tools/integrations
  • Human review rate — % needing manual approval

Real example — SaaS email-support agent:

  • Week 1: 52% completion (edge cases everywhere)
  • Week 4: 78% (refined classification logic)
  • Week 8: 87% (production-stable)
  • Result: 3,000 emails/month automated, 2.5 FTE-hours/day freed, ~$18K/month saved.

If completion stays low or human-review rate stays high, the problem may not actually be deterministic enough — that's your signal to look at Category 2.

When you've outgrown Category 1

Watch for:

  1. The flowchart has 30+ nodes and you're adding branches weekly.
  2. Customers phrase things in ways you can't anticipate; mapping every variation is impossible.
  3. The agent needs to choose which API or knowledge source to use based on context, not a predefined path.
  4. Ambiguous requests need exploration and adaptation, not predefined decomposition.
  5. The remaining high-value opportunities can't be expressed as predictable workflows.
  6. You've already automated the obvious quick wins.

Several of these together = move to Category 2.


Category 2 — Reasoning and acting agents (ReAct)

What it is

Instead of defining the flow, you define the available tools (functions the agent can call) and let an LLM decide what to do next. The agent runs a loop:

observe → reason → act → observe result → repeat

Key inversion vs. Category 1: you control the tools; the LLM controls the reasoning.

Tools: LangGraph, CrewAI, AutoGen and other orchestration libraries that support tool use, memory, and dynamic planning.

ReAct loop diagram

When to use it

When user requests are ambiguous, workflows can't be mapped in advance, and the value is in flexible contextual decision-making. Category 2 carries higher cost and execution risk than Category 1, but unlocks use cases deterministic automation can't touch.

Category 2 profile

How to recognize a Category 2 problem

The defining test: the same user request can trigger different action sequences each time. Other tells:

  • 5–15+ distinct capabilities, and the right one depends on context.
  • User intent is ambiguous and may need a clarifying back-and-forth.
  • Multiple input modalities (voice, image, text) need to be interpreted contextually.
  • Decomposing complex requests into sub-tasks is itself part of the value.

This fits 25–30% of agent opportunities.

Worked example — voice-enabled shopping assistant. A customer uploads a photo of shoes and says: "These are too small. I need a size up, and I want them delivered by Thursday."

Under the hood:

  1. Observe — mixed input: image + voice.
  2. Reason — plan: identify the product → find size variants → check delivery → confirm.
  3. Act — dynamically picks tools: visual_search()check_inventory()get_delivery_options()place_order().
  4. Observe & re-reason — each tool result feeds back into the next decision.

Crucially, this sequence cannot be pre-defined:

  • Out of stock? Suggest alternatives.
  • No Thursday delivery? Propose pickup.
  • Image unrecognizable? Ask a clarifying question.

Same request → different action sequences depending on world state. That's why Category 1 mapping fails here.

Shopping agent flow

Other Category 2 examples:

  • Conversational customer support.
  • A code assistant that modifies repos (e.g., Claude Code).
  • An intelligent personal shopping assistant.
  • IT troubleshooting agent.
  • A sales copilot that researches accounts and drafts outreach.
  • A multimodal assistant combining voice, image, and text.

How to evaluate it

Question to answer: was dynamic reasoning actually necessary, or could this have been a simpler workflow?

  • Task completion rate — % of sessions where the user's goal is achieved
  • Reasoning accuracy — correctness of decomposition, tool choice, ordering
  • Conversation length — average turns to resolution
  • Multimodal accuracy — image/voice/structured-input interpretation
  • Tool-call efficiency — average tool calls per successful session
  • Latency — per turn and end-to-end
  • Cost per session — LLM + API spend per interaction
  • User satisfaction — CSAT or equivalent
  • Business impact — lift in conversion, retention, or task success vs. baseline

Real example — voice + image shopping assistant for a home-goods retailer:

  • Month 1: 71% task completion, longer chats, higher tool usage, $0.12/session
  • Month 4: 86% task completion, shorter chats, fewer tool calls, $0.08/session
  • Image-recognition accuracy: 76% → 91%; conversion lift: +8% → +22%; CSAT: 4.0 → 4.5.

The healthy pattern: completion rate goes up while conversation length, tool calls, and cost per session go down. If completion stalls while costs stay high, the problem is over-scoped and probably belongs in Category 1.

When you've outgrown Category 2

  1. A single agent is juggling too many domains (support + inventory + logistics + finance) and quality is degrading.
  2. You need agents to delegate to other agents, not just call stateless APIs (e.g. shopping agent asks an inventory agent: "Check all warehouses and suggest alternatives.").
  3. Tasks take hours or days (e.g., an eval agent processing 10,000 conversations overnight).
  4. You need hundreds of agent instances running in parallel and coordinating.
  5. Different teams want to own their own specialized agents but those agents must work together.

Two or three of these at once → time for Category 3.


Category 3 — Multi-agent networks (intro only — rest paywalled)

What it is

Instead of one agent calling tools, you have multiple specialized agents that coordinate with each other. Each agent is owned by a different team, handles its own domain, and can request help from peer agents. Think enterprise systems built on Google ADK or AutoGen.

These are typically reserved for late-stage roadmaps where multiple teams must coordinate across domains. They should almost never be where a roadmap starts.

The remainder of the Category 3 deep-dive — when to prioritize, recognition criteria, evaluation metrics, real-world examples, and the full tool-selection guide — is behind Lenny's paywall.


The big takeaways

  • "Agent" is an umbrella term, not a category. Stop ranking deterministic workflows, ReAct agents, and multi-agent networks on the same effort/impact matrix.
  • Categorize first, then prioritize. The category dictates timeline, team, infrastructure, cost, and metrics.
  • Most opportunities (60–70%) are Category 1. Start with deterministic workflows in n8n/Zapier/AgentKit-style tools. They ship fast, prove ROI, and build org confidence.
  • Reach for Category 2 only when the same input can legitimately produce different action sequences. Otherwise you're paying ReAct prices for Zapier-level work.
  • Multi-agent networks are an end-state, not a starting point. Use them when delegation, parallelism, or cross-team ownership genuinely demands it.
  • Watch the metric trend, not the snapshot. A healthy agent shows rising completion with falling cost, latency, and human-review rate. If those move the wrong way together, you've probably picked the wrong category.
#AI#AI_AGENTS#AUTOMATION#CONTENT#PRODUCT#GROWTH

Author

Unknown Author

The weekly builder brief

Subscribe for free. Get the signal. Skip the noise.

Get one focused email each week with 5-minute reads on product, engineering, growth, and execution - built to help you make smarter roadmap and revenue decisions.

Free forever. Takes 5 seconds. Unsubscribe anytime.

Join 1,872+ product leaders, engineers & founders already getting better every Tuesday.