The Pragmatic Engineer
Designing Data-Intensive Applications with Martin Kleppmann — Lessons from the New Edition
Gergely Orosz
Apr 22, 2026
Designing Data-Intensive Applications with Martin Kleppmann — Lessons from the New Edition
Source: The Pragmatic Engineer · Author: Gergely Orosz · Date: 2026-04-22 · Original article
This is a podcast episode (≈1h25m) where Gergely Orosz interviews Martin Kleppmann — researcher, Bluesky advisor, and author of Designing Data-Intensive Applications (DDIA), one of the most-cited books on modern backend systems. The second edition just shipped, and the conversation covers what changed, what stayed, and where infrastructure is heading. The newsletter post itself is a structured set of takeaways and references rather than a long-form article, so this summary expands each takeaway into the underlying intuition the episode conveys.
Who Martin Kleppmann is, and why the book exists
Martin's path is unusual: he co-founded Rapportive (a Y Combinator startup that put LinkedIn-style profile cards inside Gmail), sold it to LinkedIn, worked there as Kafka was being open-sourced, then left industry for academia at Cambridge, where he now researches local-first software and cryptographic supply-chain transparency.
The book was born from pain, not theory. At Rapportive his team kept hitting database performance walls and were, in his words, "drowning" — making consequential design decisions without a mental model of how storage engines, replication, indexes, and consistency actually behave. They learned the fundamentals the hard way. DDIA is the book he wishes he'd had then.
A key clarification he repeats: DDIA is not for people building databases. It's for application developers — the people who pick a database, design a schema, debug a slow query, or argue about whether to add a cache. The goal is intuition, so when something goes wrong in production you have hypotheses instead of guesses.
How seeing Kafka up close shaped the book
While Martin was at LinkedIn, Kafka — an event streaming platform now used almost everywhere — was being open-sourced. Watching a serious distributed system being built in the open, by people who had to make every tradeoff explicit, gave him a unifying mental model: most data systems are variations on a small set of primitives (logs, indexes, replication, partitioning, transactions). Once you see that, you stop treating Postgres, Kafka, Elasticsearch, and Redis as alien species and start seeing the family resemblance. That's the lens the book teaches.
Reliability, scalability, repeatability — and the cloud's effect
DDIA frames system quality around three pillars: reliability (works correctly even when things go wrong), scalability (copes with growth), and maintainability (people can keep working on it). The second edition's biggest shifts come from how the cloud has rewritten the scaling chapter:
- Replication for fault tolerance is still relevant at every scale. If a machine dies, you want another one with your data. This never went away.
- Sharding has become a specialist concern. Single machines are now huge — terabytes of RAM, hundreds of cores — and managed services hide a lot of partitioning. For most teams, manual sharding across machines is no longer a daily worry. The book still has a chapter on it because when you do need it, the consequences are severe, but the framing is "you probably don't need this yet."
- Multi-region and multi-cloud are not best practices. Martin is firm on this: they are a risk-vs-cost trade-off, and ultimately a business decision. The job of an engineer is to articulate the tradeoff (extra latency, extra ops complexity, extra cost vs. resilience to a regional outage or a vendor going bad) — not to dictate the answer. DDIA gives you the vocabulary, not the verdict.
- Scaling down is as interesting as scaling up. Most engineers think "scale" means handling more traffic. But efficiently shrinking when traffic drops — paying nothing at 3am — is its own hard problem. Serverless is one of the building blocks that makes scaling-down practical, and the new edition takes it more seriously.
What got cut: MapReduce is mostly gone
The first edition had heavy MapReduce coverage. In 2026, almost nobody runs MapReduce directly — Spark, Flink, and friends took over. The new edition keeps just enough MapReduce as a teaching tool: it's still the cleanest way to build intuition about partitioned batch computation (split data, process in parallel, shuffle, combine). But it's no longer presented as a tool you'd reach for.
Why distributed systems theory looks paranoid
A section of the conversation is dedicated to why distributed systems papers feel absurdly pessimistic. The standard model assumes there is no upper bound on how long a network message might take — it could arrive in 100 microseconds, or 10 years, or never. Clocks can drift arbitrarily. Processes can pause for minutes (a long GC, a VM migration). Crashes can happen at the worst possible instant.
This is on purpose. If your correctness argument depends on "the network is usually fast" or "clocks are usually right," then the one time those assumptions break — and at scale, they will break — you get silent data corruption or split-brain. Reasoning under worst-case assumptions is the price of building systems that don't lie to users when something rare happens. Martin's point: those weird edge cases in textbooks are not academic curiosities; occasionally reality really does hit those extremes, and that's when companies make the news.
Engineering ethics: surfacing risk, not just shipping features
Martin argues an engineer's role is increasingly about making tradeoffs legible to decision-makers — and the relevant tradeoffs now include reputational, legal, and societal risk, not only latency and cost. If you're the only person in the room who understands what a feature actually does to users' data, you have an obligation to name the risk clearly enough that a non-engineer can decide. This is a recurring theme of the new edition.
Formal verification: from too-expensive to maybe-mainstream
Formal verification means proving a piece of code mathematically correct against a specification, using tools like TLA+, Isabelle, Rocq (formerly Coq), Lean, or FizzBee. Martin says he never used it in industry — it was simply too slow and too expensive for production timelines.
He thinks two AI-era trends might flip that economics:
- LLMs produce far more code than humans can carefully review. Human review becomes the bottleneck, and reading suspicious code is exactly where bugs slip through.
- LLMs are getting good at writing formal proofs, which were the other expensive part of formal methods.
Combine them and a workflow emerges: the LLM writes the code and a machine-checkable proof that it satisfies the spec. The proof checker — not the human reviewer — becomes the trust boundary. He wrote about this in detail in his "AI will make formal verification go mainstream" essay, referenced in the episode.
Local-first software: harder than it sounds
Local-first software is software where your data lives primarily on your own devices, syncs peer-to-peer, and works offline — with the cloud as an optional convenience rather than the source of truth. Martin co-authored the foundational paper on this and is researching it now.
The hard part isn't sync — CRDTs (conflict-free replicated data types) handle merging concurrent edits. The hard part is access control without a central server. A simple example he gives: imagine you revoke a user's access to a document. While the revocation is propagating, that user, on a plane, makes an edit. Their device thinks they still have access; yours thinks they don't. When the devices reconnect, they disagree about whether the edit "really happened." With a central server you just ask the server. Without one, you need a cryptographic protocol that produces the same answer on every device, even when they saw events in different orders. That's an open research problem, and it's where his current work lives.
Industry vs academia: each side underestimates the other
Martin has lived in both worlds and dislikes the mutual dismissal:
- Industry tends to call academic work "theoretical" and miss research that would directly help (he points at things like CRDTs, which started in academia and now run inside collaborative editors used by millions).
- Academia tends to call industry work "just engineering" and miss the genuinely hard problems being solved at scale.
His practical observation: the best PhD students he supervises are the ones who spent a few years as engineers first. They know which problems matter, they have intuition for what "production" means, and they don't waste years on elegant solutions to non-problems. He'd like to see more flow in both directions.
What he's working on now
Beyond local-first, Martin's research is using cryptography to bring transparency to supply chains without exposing sensitive data — essentially, letting parties prove things about goods (origin, certifications, chain of custody) without revealing competitive information like prices or suppliers. It's a concrete application of zero-knowledge-style techniques to a non-crypto-currency problem.
The 12 takeaways at a glance
For quick reference, the post itself organizes Martin's points as a numbered list. In compressed form:
- Watching Kafka being built at LinkedIn shaped DDIA's unifying mental model of data systems.
- He wrote DDIA because at Rapportive his team was "drowning" in database problems with no foundations.
- Knowing system internals is a superpower for application developers — the book is for them, not for database authors.
- Multi-region / multi-cloud are risk-vs-cost trade-offs, not best practices. Engineers articulate; business decides.
- Scaling down (e.g., serverless) is as interesting and challenging as scaling up.
- Replication for fault tolerance still matters everywhere; manual sharding has become a specialist concern.
- MapReduce is effectively dead in production, but kept as a teaching tool for partitioned batch computation.
- Distributed-systems theory's worst-case assumptions (unbounded delays, bad clocks, arbitrary crashes) are deliberate — reality hits those extremes occasionally.
- Engineers must surface risks — including societal ones — to decision-makers.
- Formal verification was too costly for industry; LLMs (more code + proof generation) may finally make it mainstream.
- Local-first software is hard mostly because of decentralized access control, not sync.
- The mutual dismissal between industry and academia hurts both; the best PhD students have prior engineering experience.
Episode chapters (timestamps)
A rough map for listeners: early career (00:00) · Rapportive (05:46) · LinkedIn (10:47) · writing DDIA (14:09) · reliability/scalability/repeatability (23:00) · DDIA second edition (26:24) · cloud tradeoffs (30:50) · how the cloud changed scaling (39:02) · the trouble with distributed systems (42:53) · ethics (49:02) · formal verification (52:45) · academia vs industry (1:00:12) · local-first (1:03:50) · CS education (1:09:50) · current research and advice (1:18:32).
Where to go deeper
- Designing Data-Intensive Applications, 2nd ed. — the book itself.
- Martin's Distributed Systems lecture series on YouTube — free, and an excellent companion to the book.
- Local-First Software: You Own Your Data, in spite of the Cloud — the foundational paper.
- AI will make formal verification go mainstream — Martin's essay on the LLM + proofs argument.
- Tools mentioned: TLA+, Isabelle, Rocq, Lean, FizzBee.
- Pragmatic Engineer related deep dives: Building Bluesky, Inside Uber's move to the cloud, The history of servers and the cloud, How Kubernetes is built, How AWS S3 is built.
Note: this newsletter post is the show-notes companion to a podcast episode. The deepest material is in the audio/video and the linked book; the post (and this summary) capture the episode's framework and key arguments.
Author
Gergely Orosz
Continued reading
Keep your momentum

MKT1 Newsletter
100 B2B Startups, 100+ Stats, and 14 Graphs on Web, Social, and Content
This is Part 2 of MKT1's three-part State of B2B Marketing Report. Where Part 1 looked at teams and leadership , Part 2 turns to what marketing teams are actually doing — what their websites look like, how they use social, and what "content fuel" they're producing. Emily Kramer u
Apr 28 · 10m
Lenny's Newsletter (Lenny's Podcast)
Why Half of Product Managers Are in Trouble — Nikhyl Singhal on the AI Reinvention Threshold
Nikhyl Singhal is a serial founder and a former senior product executive at Meta, Google, and Credit Karma . Today he runs The Skip ( skip.show (https://skip.show)), a community for senior product leaders, plus offshoots like Skip Community , Skip Coach , and Skip.help . Lenny de
Apr 27 · 7m

The AI Corner
The AI Agent That Thinks Like Jensen Huang, Elon Musk, and Dario Amodei
Dominguez opens with a claim that is easy to skim past but worth stopping on: the difference between elite founders and everyone else is not raw IQ or speed — it is that each of them has internalized a repeatable mental procedure they run on every important decision. The procedur
Apr 27 · 6m