The Pragmatic Engineer

Designing Data-Intensive Applications with Martin Kleppmann — Lessons from the New Edition

GO

Gergely Orosz

Apr 22, 2026

9 min read

Designing Data-Intensive Applications with Martin Kleppmann — Lessons from the New Edition

Source: The Pragmatic Engineer · Author: Gergely Orosz · Date: 2026-04-22 · Original article

This is a podcast episode (≈1h25m) where Gergely Orosz interviews Martin Kleppmann — researcher, Bluesky advisor, and author of Designing Data-Intensive Applications (DDIA), one of the most-cited books on modern backend systems. The second edition just shipped, and the conversation covers what changed, what stayed, and where infrastructure is heading. The newsletter post itself is a structured set of takeaways and references rather than a long-form article, so this summary expands each takeaway into the underlying intuition the episode conveys.


Who Martin Kleppmann is, and why the book exists

Martin's path is unusual: he co-founded Rapportive (a Y Combinator startup that put LinkedIn-style profile cards inside Gmail), sold it to LinkedIn, worked there as Kafka was being open-sourced, then left industry for academia at Cambridge, where he now researches local-first software and cryptographic supply-chain transparency.

The book was born from pain, not theory. At Rapportive his team kept hitting database performance walls and were, in his words, "drowning" — making consequential design decisions without a mental model of how storage engines, replication, indexes, and consistency actually behave. They learned the fundamentals the hard way. DDIA is the book he wishes he'd had then.

A key clarification he repeats: DDIA is not for people building databases. It's for application developers — the people who pick a database, design a schema, debug a slow query, or argue about whether to add a cache. The goal is intuition, so when something goes wrong in production you have hypotheses instead of guesses.

How seeing Kafka up close shaped the book

While Martin was at LinkedIn, Kafka — an event streaming platform now used almost everywhere — was being open-sourced. Watching a serious distributed system being built in the open, by people who had to make every tradeoff explicit, gave him a unifying mental model: most data systems are variations on a small set of primitives (logs, indexes, replication, partitioning, transactions). Once you see that, you stop treating Postgres, Kafka, Elasticsearch, and Redis as alien species and start seeing the family resemblance. That's the lens the book teaches.

Reliability, scalability, repeatability — and the cloud's effect

DDIA frames system quality around three pillars: reliability (works correctly even when things go wrong), scalability (copes with growth), and maintainability (people can keep working on it). The second edition's biggest shifts come from how the cloud has rewritten the scaling chapter:

  • Replication for fault tolerance is still relevant at every scale. If a machine dies, you want another one with your data. This never went away.
  • Sharding has become a specialist concern. Single machines are now huge — terabytes of RAM, hundreds of cores — and managed services hide a lot of partitioning. For most teams, manual sharding across machines is no longer a daily worry. The book still has a chapter on it because when you do need it, the consequences are severe, but the framing is "you probably don't need this yet."
  • Multi-region and multi-cloud are not best practices. Martin is firm on this: they are a risk-vs-cost trade-off, and ultimately a business decision. The job of an engineer is to articulate the tradeoff (extra latency, extra ops complexity, extra cost vs. resilience to a regional outage or a vendor going bad) — not to dictate the answer. DDIA gives you the vocabulary, not the verdict.
  • Scaling down is as interesting as scaling up. Most engineers think "scale" means handling more traffic. But efficiently shrinking when traffic drops — paying nothing at 3am — is its own hard problem. Serverless is one of the building blocks that makes scaling-down practical, and the new edition takes it more seriously.

What got cut: MapReduce is mostly gone

The first edition had heavy MapReduce coverage. In 2026, almost nobody runs MapReduce directly — Spark, Flink, and friends took over. The new edition keeps just enough MapReduce as a teaching tool: it's still the cleanest way to build intuition about partitioned batch computation (split data, process in parallel, shuffle, combine). But it's no longer presented as a tool you'd reach for.

Why distributed systems theory looks paranoid

A section of the conversation is dedicated to why distributed systems papers feel absurdly pessimistic. The standard model assumes there is no upper bound on how long a network message might take — it could arrive in 100 microseconds, or 10 years, or never. Clocks can drift arbitrarily. Processes can pause for minutes (a long GC, a VM migration). Crashes can happen at the worst possible instant.

This is on purpose. If your correctness argument depends on "the network is usually fast" or "clocks are usually right," then the one time those assumptions break — and at scale, they will break — you get silent data corruption or split-brain. Reasoning under worst-case assumptions is the price of building systems that don't lie to users when something rare happens. Martin's point: those weird edge cases in textbooks are not academic curiosities; occasionally reality really does hit those extremes, and that's when companies make the news.

Engineering ethics: surfacing risk, not just shipping features

Martin argues an engineer's role is increasingly about making tradeoffs legible to decision-makers — and the relevant tradeoffs now include reputational, legal, and societal risk, not only latency and cost. If you're the only person in the room who understands what a feature actually does to users' data, you have an obligation to name the risk clearly enough that a non-engineer can decide. This is a recurring theme of the new edition.

Formal verification: from too-expensive to maybe-mainstream

Formal verification means proving a piece of code mathematically correct against a specification, using tools like TLA+, Isabelle, Rocq (formerly Coq), Lean, or FizzBee. Martin says he never used it in industry — it was simply too slow and too expensive for production timelines.

He thinks two AI-era trends might flip that economics:

  1. LLMs produce far more code than humans can carefully review. Human review becomes the bottleneck, and reading suspicious code is exactly where bugs slip through.
  2. LLMs are getting good at writing formal proofs, which were the other expensive part of formal methods.

Combine them and a workflow emerges: the LLM writes the code and a machine-checkable proof that it satisfies the spec. The proof checker — not the human reviewer — becomes the trust boundary. He wrote about this in detail in his "AI will make formal verification go mainstream" essay, referenced in the episode.

Local-first software: harder than it sounds

Local-first software is software where your data lives primarily on your own devices, syncs peer-to-peer, and works offline — with the cloud as an optional convenience rather than the source of truth. Martin co-authored the foundational paper on this and is researching it now.

The hard part isn't sync — CRDTs (conflict-free replicated data types) handle merging concurrent edits. The hard part is access control without a central server. A simple example he gives: imagine you revoke a user's access to a document. While the revocation is propagating, that user, on a plane, makes an edit. Their device thinks they still have access; yours thinks they don't. When the devices reconnect, they disagree about whether the edit "really happened." With a central server you just ask the server. Without one, you need a cryptographic protocol that produces the same answer on every device, even when they saw events in different orders. That's an open research problem, and it's where his current work lives.

Industry vs academia: each side underestimates the other

Martin has lived in both worlds and dislikes the mutual dismissal:

  • Industry tends to call academic work "theoretical" and miss research that would directly help (he points at things like CRDTs, which started in academia and now run inside collaborative editors used by millions).
  • Academia tends to call industry work "just engineering" and miss the genuinely hard problems being solved at scale.

His practical observation: the best PhD students he supervises are the ones who spent a few years as engineers first. They know which problems matter, they have intuition for what "production" means, and they don't waste years on elegant solutions to non-problems. He'd like to see more flow in both directions.

What he's working on now

Beyond local-first, Martin's research is using cryptography to bring transparency to supply chains without exposing sensitive data — essentially, letting parties prove things about goods (origin, certifications, chain of custody) without revealing competitive information like prices or suppliers. It's a concrete application of zero-knowledge-style techniques to a non-crypto-currency problem.


The 12 takeaways at a glance

For quick reference, the post itself organizes Martin's points as a numbered list. In compressed form:

  1. Watching Kafka being built at LinkedIn shaped DDIA's unifying mental model of data systems.
  2. He wrote DDIA because at Rapportive his team was "drowning" in database problems with no foundations.
  3. Knowing system internals is a superpower for application developers — the book is for them, not for database authors.
  4. Multi-region / multi-cloud are risk-vs-cost trade-offs, not best practices. Engineers articulate; business decides.
  5. Scaling down (e.g., serverless) is as interesting and challenging as scaling up.
  6. Replication for fault tolerance still matters everywhere; manual sharding has become a specialist concern.
  7. MapReduce is effectively dead in production, but kept as a teaching tool for partitioned batch computation.
  8. Distributed-systems theory's worst-case assumptions (unbounded delays, bad clocks, arbitrary crashes) are deliberate — reality hits those extremes occasionally.
  9. Engineers must surface risks — including societal ones — to decision-makers.
  10. Formal verification was too costly for industry; LLMs (more code + proof generation) may finally make it mainstream.
  11. Local-first software is hard mostly because of decentralized access control, not sync.
  12. The mutual dismissal between industry and academia hurts both; the best PhD students have prior engineering experience.

Episode chapters (timestamps)

A rough map for listeners: early career (00:00) · Rapportive (05:46) · LinkedIn (10:47) · writing DDIA (14:09) · reliability/scalability/repeatability (23:00) · DDIA second edition (26:24) · cloud tradeoffs (30:50) · how the cloud changed scaling (39:02) · the trouble with distributed systems (42:53) · ethics (49:02) · formal verification (52:45) · academia vs industry (1:00:12) · local-first (1:03:50) · CS education (1:09:50) · current research and advice (1:18:32).

Where to go deeper

Note: this newsletter post is the show-notes companion to a podcast episode. The deepest material is in the audio/video and the linked book; the post (and this summary) capture the episode's framework and key arguments.

#AI#ENGINEERING#AUTOMATION#CONTENT#GROWTH#STARTUPS

Author

Gergely Orosz

The weekly builder brief

Subscribe for free. Get the signal. Skip the noise.

Get one focused email each week with 5-minute reads on product, engineering, growth, and execution - built to help you make smarter roadmap and revenue decisions.

Free forever. Takes 5 seconds. Unsubscribe anytime.

Join 1,872+ product leaders, engineers & founders already getting better every Tuesday.