The AI Corner
Stanford AI engineering: 10 lessons most builders get wrong
Ruben Dominguez
Apr 14, 2026
Stanford AI engineering: 10 lessons most builders get wrong
Source: The AI Corner · Author: Ruben Dominguez · Date: 2026-04-14 · Original article
A 2-hour lecture from Stanford's CS230 covers what most AI teams spend six months learning through painful production incidents. This is a distilled, beginner-friendly walkthrough of the ten rules that matter — what they mean, why they matter, and what to actually do about them.
The thesis under everything below: most AI products fail at the engineering layer, not the model layer. The model is fine. What you build around it is not.
1. Prompt training is the difference between AI helping you and AI making you worse
Harvard, UPenn, and Wharton ran a study (paid for by BCG) that split BCG consultants into three groups for a set of real consulting tasks:
- No AI at all
- AI access, no training
- AI access with prompt training
The trained group beat everyone on nearly every task. The interesting twist: the untrained-AI group performed worse than the people using nothing. They stopped thinking. The model filled the gap. Badly. The lecturer calls this "falling asleep at the wheel" — relying on AI on tasks that are beyond its current frontier of competence.
Two behavior patterns separated the people who used AI well:
- Centaurs — write one long, carefully-constructed prompt, walk away, come back to a finished output.
- Cyborgs — rapid back-and-forth conversation, iterating in real time.
Both work. What does not work is using the model like a search engine and pasting the first answer into your deliverable.
Implication for teams: prompt training is the highest-leverage investment you can make — before new tools, before new models, before new hires. The "untrained" group is most of your workforce right now.
2. One prompt doing three jobs is a black box. Three prompts doing one job each is a system you can fix.
Imagine a single prompt that (a) extracts facts from a document, (b) outlines a blog post, and (c) writes the final draft. When the output is bad, you have no idea which of those three steps broke. You're staring at a black box.
Now break it into a chain of three sequential prompts: extraction → outlining → drafting. Run them on ten real inputs. The outline looks great. The final draft is off-brand. You know exactly where to fix things — only prompt three.
"Chaining improves performance, but most importantly, helps you control your workflow and debug it more seamlessly."
The real win of chaining isn't raw quality — it's visibility. You can now measure each step, swap models per step, and isolate failures. This is the basic move behind what's now called "context engineering," which has largely replaced single-shot prompt engineering as the leverage point in 2026.
Try this: take your most important single-prompt workflow, decompose it into three sequential prompts, and run both versions on the same ten inputs. The step you weren't measuring is almost always where quality is leaking.
3. Fine-tuning is usually the wrong move — and Workera has a hilarious story to prove it
Workera fine-tuned a model on their company's Slack data so it would "speak like the team." When they asked it to write a blog post, it replied:
"I shall work on that in the morning."
Pushed harder: "I'm writing right now. It's 6:30 a.m. here."
It had overfit to how humans procrastinate on Slack and lost the ability to follow instructions at all. Funny — and a perfect cautionary tale.
Beyond the anecdote, there's a structural problem: by the time you finish fine-tuning a model, the next base model ships and beats your fine-tuned version of the previous one. You're racing a moving frontier with a slower car.
Fine-tuning is still the right call in three narrow cases:
- Repeated, high-precision domain outputs (legal, scientific) where exact format matters.
- Persistent domain-language failures that general models cannot handle.
- Tasks where latency and cost savings genuinely justify the overhead.
Most teams fine-tune because it signals effort. It is usually just slower — and now often obsolete before it ships.
4. Every knowledge product has a hallucination problem. RAG is the foundational fix.
Bigger context windows are not the answer. They're a latency-and-cost problem with no sourcing — the model can still hallucinate, and you can't show the user where an answer came from.
RAG (Retrieval-Augmented Generation) is the architecture that solves this. It's four steps:
- Embed your documents (turn each chunk into a vector — a list of numbers that captures meaning) and store them in a vector database.
- Embed the user's query the same way and pull the nearest-matching documents.
- Inject those documents into the prompt as context.
- Add one explicit rule to the prompt: if the answer is not in these documents, say so.
That last rule is what turns RAG from a tech demo into a product. Without it, the model still confabulates when retrieval misses.
For any knowledge-intensive product (support bots, internal Q&A, anything that cites docs), RAG isn't an upgrade — it's the foundation.
5. Vanilla RAG fails deep in long documents. Two techniques fix it. Almost nobody uses either.
Retrieving the right file is not the same as finding the right paragraph inside it. Run vanilla RAG against a 300-page clinical handbook and you'll often land on the correct document but the wrong page.
Two fixes:
Hierarchical chunking
Don't embed only at the document level. Embed at multiple granularities — document, chapter, section, paragraph — and store all of them. Now your retrieval can cite page and section, not just file name. Quality goes up dramatically on deep documents.
HyDE (Hypothetical Document Embeddings)
A user's question rarely looks like the answer linguistically. "What's the dose for X in elderly patients?" looks nothing like a paragraph in a clinical doc, so vector distance is large and retrieval misses.
The trick: have the model generate a fake, hallucinated answer to the query first, then embed that and use it for retrieval. A made-up answer linguistically resembles real documents far more than the question ever did. Counterintuitively, hallucinating on purpose makes retrieval more accurate.
Try this: run your RAG on five deep-document queries. Below 70% accuracy, add hierarchical chunking. Still below, test HyDE.
6. Everyone's building "agents." Most are building the wrong thing.
The word "agent" is overloaded. The lecturer is blunt:
"Calling everything an agent doesn't do it justice. In practice, it's a bunch of prompts with tools, with additional resources, API calls that ultimately are put in a workflow."
Concrete contrast:
- RAG answer: "Refunds are available within 30 days." (Question answered.)
- Agentic workflow: asks for the order number, queries the database, confirms eligibility, issues the refund, and tells the user when the money lands. (Problem solved.)
There are three levels of autonomy you can give such a system, and they trade off control for flexibility:
- Hard-coded steps — you define the sequence; the model only fills in content at each step.
- Hard-coded tools, model-chosen order — the model picks which tool to call when, from a fixed toolbox.
- Fully autonomous — the model decides everything, including which tools to use and when to stop.
Higher autonomy = more impressive demo, less trustworthy output. Choose your autonomy level before you write a single line of code, based on how much you can afford to trust the result. Most teams skip this decision and end up with a system they can't debug or constrain.
7. The agents work. The org chart is the constraint.
McKinsey's case study: credit risk memos. A relationship manager pulls from 15+ sources; a credit analyst writes for 20+ hours; total turnaround is one to four weeks.
Drop in a multi-agent system — specialist agents working in parallel, drafting in coordination, humans reviewing and closing — and you get 20–60% time savings. The technology exists today.
So why isn't every bank running this? Because rewiring job descriptions, incentives, approval chains, and habits across 100,000 people takes a decade.
"The hardest part is changing people. It will take 10, 20 years to get to this being actually done at scale within an organization because change is so hard."
Strategic takeaway for founders: the companies that help large organizations operationalize AI changes — not just sell them agents — capture the disproportionate value over the next five years. Distribution and change management are the moat, not the model.
8. AI-powered software has a new failure mode that's silent until production breaks
Traditional software is deterministic: same input → same output, every time. User submits form, form writes to database. Predictable.
AI-powered software is fuzzy: user types anything, model interprets it, model acts. The gap between those two sentences is where engineering debt accumulates fast.
"Fuzzy engineering is truly hard. You might get hate as a company because one user did something that you authorized them to do that ended up breaking the database."
Fuzzy systems have four failure modes that don't exist in traditional software:
- Expanding security surfaces — every natural-language input is a potential injection.
- Probabilistic debugging with no stack trace — the bug is "the model misinterpreted this 1 in 50 times."
- Evals instead of unit tests — you can't assert exact equality on free-form output.
- Errors invisible until production — the failure shows up in front of a real user.
Try this: map your product's flow as a graph. Mark every step D (deterministic) or F (fuzzy). If more than 40% of steps are F, you're building something fragile. For every F you can, find a deterministic equivalent — wrap fuzzy components in deterministic infrastructure (validators, guardrails, retries, schema enforcement).
9. One interview question tells you everything about an AI startup
"If you're interviewing with an AI startup, I would recommend you ask them: do you have LLM traces? Because if they don't, it is pretty hard to debug an LLM system."
LLM traces are the AI equivalent of structured logs: a record of every prompt, every tool call, every intermediate output, every token. Without them, "the agent did something weird yesterday" is unsolvable.
Evals — automated tests for LLM systems — should cover four dimensions:
- End-to-end — did the whole user experience succeed or fail?
- Component-based — which step in the chain broke it?
- Objective — automated checks (did the agent return the correct order ID?).
- Subjective — LLM-as-judge or human review (was the tone right? was the explanation clear?).
Start small: 20 hand-picked examples will surface failure modes faster than any dashboard. Then scale.
The non-negotiable rule: every fuzzy step needs at least one automated eval and one LLM judge in production before you ship — not after.
This works in reverse for due diligence. If you're an investor or a candidate evaluating an AI startup, "do you have LLM traces and evals on every fuzzy step?" tells you more about their engineering culture than any pitch deck.
10. The next architecture matters more than every GPU farm on earth
Scaling laws have ceilings. Throwing more compute at transformers will keep paying for a while, but not forever. The group that finds what comes after transformers defines the next decade of AI.
"Whoever discovered transformers had a tremendous impact on the direction of AI. I think we're going to see more of that in the coming years where some group of researchers that is iterating fast might discover certain things that would suddenly unlock that plateau and take us to the next step."
One discovery — say, an architecture that cuts compute by 10× — would change every product built on every model overnight.
Three vectors worth tracking:
- Architecture search — the replacement for the transformer hasn't been found. This is arguably the field's most important open problem.
- Multimodality — adding modalities (vision, audio, action) compounds gains across all of them. The endpoint of this trajectory is robotics.
- Method integration — pre-training, supervised fine-tuning, reinforcement learning, and unsupervised observation will be combined, not chosen between. Hybrid training pipelines are the near future.
The half-life of any specific technique is short. The half-life of understanding the principles underneath is long. Build on what works today, but stay close to the research, because the shift that makes everything obsolete is coming and nobody knows when.
What to do tomorrow
The gap between a weak AI product and a great one is almost never the model. It's the engineering layer. Five principles to steal from the lecture:
- Chain over single prompts for any multi-step task. Visibility beats cleverness.
- RAG over fine-tuning for knowledge-heavy applications. Fine-tune only in the three narrow cases above.
- Evals are not optional. Build them before you ship, not after. Every fuzzy step gets one automated eval and one LLM judge.
- The most autonomous architecture is rarely the best one. Choose autonomy by how much you can trust the output, not by how impressive the demo looks.
- Learn the principles, not the latest trick. Specific techniques have a short shelf life; engineering instincts don't.
For founders: decompose before you code. Map every deterministic and fuzzy step. Build deterministic infrastructure first. Wrap every fuzzy component in evals from day one.
For enterprise teams: pick one workflow, measure it without mercy, and start the organizational change now — that's the hard part, not the technology.
For engineers: build evals before you ship. Use LLM traces. Ask every AI startup you interview whether they have them.
Prompts are levers. Chains are systems. Agents are organizations. The model is not the product — the engineering layer is. Start there.
Author
Ruben Dominguez
Continued reading
Keep your momentum

MKT1 Newsletter
100 B2B Startups, 100+ Stats, and 14 Graphs on Web, Social, and Content
This is Part 2 of MKT1's three-part State of B2B Marketing Report. Where Part 1 looked at teams and leadership , Part 2 turns to what marketing teams are actually doing — what their websites look like, how they use social, and what "content fuel" they're producing. Emily Kramer u
Apr 28 · 10m
Lenny's Newsletter (Lenny's Podcast)
Why Half of Product Managers Are in Trouble — Nikhyl Singhal on the AI Reinvention Threshold
Nikhyl Singhal is a serial founder and a former senior product executive at Meta, Google, and Credit Karma . Today he runs The Skip ( skip.show (https://skip.show)), a community for senior product leaders, plus offshoots like Skip Community , Skip Coach , and Skip.help . Lenny de
Apr 27 · 7m

The AI Corner
The AI Agent That Thinks Like Jensen Huang, Elon Musk, and Dario Amodei
Dominguez opens with a claim that is easy to skim past but worth stopping on: the difference between elite founders and everyone else is not raw IQ or speed — it is that each of them has internalized a repeatable mental procedure they run on every important decision. The procedur
Apr 27 · 6m