System Design Classroom

Event-Driven Systems Are Easy to Build but Hard to Keep Correct

RJ

Raul Junco

Apr 25, 2026

11 min read

Event-Driven Systems Are Easy to Build but Hard to Keep Correct

Source: System Design Classroom · Author: Raul Junco · Date: Apr 25, 2026 · Original article


The lie at the center of event-driven systems

The biggest lie event-driven systems tell is that working infrastructure means a working system. It doesn't.

The broker is healthy. Consumers are online. Messages are flowing. Dashboards are green. And yet the business is still getting the wrong result:

  • A customer gets charged twice.
  • A refund lands before the original payment.
  • An order exists in the database, but no downstream service knows about it.
  • A producer ships a "harmless" schema change and quietly breaks consumers.

That is the real difficulty in event-driven systems. They rarely fail as loud availability incidents. They fail as quiet correctness bugs that spread across retries, ordering gaps, dual writes, and schema drift.

This piece walks through the four most common ways those quiet failures show up — and what to design for instead of around.


1) Duplication breaks correctness long before it breaks infrastructure

Most teams know duplicate events can happen. Fewer teams build like they actually believe it.

In a distributed system, duplicate delivery is normal, not exotic. It happens whenever:

  • A consumer processes an event, crashes before acknowledging it, and receives the same event again on restart.
  • A producer times out while publishing, retries, and accidentally sends a second copy.
  • A broker redelivers during failover or partition rebalance.

None of this requires anything strange to be going on. It's the standard behavior of networks + retries + failures.

The trouble starts when the business operation isn't safe to repeat:

  • PaymentCaptured processed twice → customer charged twice.
  • InventoryReserved processed twice → inventory overshoots, blocking legitimate orders.
  • OrderConfirmed processed twice → duplicate shipments, emails, loyalty points.

Infrastructure still looks healthy. The system has already started to decay.

Idempotency is part of the contract

An idempotent consumer is one that can receive the same event more than once and still leave the system in the same final state. The standard pattern:

  1. Every event carries a stable, unique ID (event_id).
  2. Before applying side effects, the consumer checks whether it has already processed that ID.
  3. If yes → skip. If no → process the event and durably record that it has been processed.

Idempotent consumer with event_id dedup table

The "durable" part matters a lot. If you process the side effect first and record the event ID afterwards, a crash between those two steps puts you right back where you started. The safest version makes the business write and the dedup record happen atomically in the same database transaction. A UNIQUE constraint on event_id is your friend — let the storage layer enforce what is easy to get wrong in code.

The boundary problem

That protection stops at the database boundary. If the consumer also:

  • Sends an email,
  • Calls a payment gateway,
  • Triggers a webhook,

…then you need an idempotency key the external service understands. Otherwise retries can still duplicate the side effect even if your local DB is safe.

Shape operations to be naturally idempotent

You can lower the risk by writing operations that are safe to repeat by design. Setting a target state is safer than applying an incremental delta:

  • status = SHIPPED is safer than shipped_count += 1.
  • Replacing a projection row is safer than appending blindly.

Small choices compound in distributed systems.

The trade-off: extra storage, retention decisions, and a more complex consumer path. But that cost is predictable. The cost of duplicate side effects is not.

Once you accept that events can show up more than once, the next uncomfortable truth is: even if every event arrives only once, it might not arrive when you expect it to.


2) Ordering assumptions collapse as soon as the system gets real traffic

A lot of event-driven workflows look perfect in local testing for one reason: local testing is too polite. Messages arrive one at a time. The consumer is fast. Nothing retries. Nothing competes for the same partition.

Production isn't polite. Events arrive out of order because:

  • A message gets delayed in the network while a later one moves through faster.
  • A retry brings back an older event after a newer one has already succeeded.
  • Parallel consumers finish work in a different sequence than messages were produced.

The workflow still runs, but the sequence stops matching reality:

  • A refund is processed before the original payment.
  • An account is closed before a final balance adjustment arrives.
  • A "delivered" event reaches a read model before the "shipped" event does.

The system isn't down. It's just wrong.

"Kafka preserves order" is half-true

Many engineers reach for a broker feature and stop thinking. Kafka and similar systems preserve order within a partition — that's useful, but it is not global ordering across the whole system. That difference matters.

Partitioning by entity key for local ordering

In most real systems, order only matters within a business entity: one order, one account, one cart, one user. That gives you the right mental model:

Partition by the entity whose timeline must remain consistent.

If every event for order-123 lands in the same partition, you get a realistic path to local ordering where it actually matters. That is usually enough.

Even partitioned ordering has gotchas

During consumer-group rebalancing, one consumer can lose a partition while another takes over. Sloppy offset handling, or side effects still in flight, can produce behavior that looks out-of-order from the business point of view, even though Kafka technically preserved partition order.

Detect out-of-order, don't just hope

A strong design carries enough information to detect stale or missing events:

  • Sequence numbers per entity.
  • Aggregate versions (e.g., the order is at version 7).
  • State-machine validationSHIPPED → DELIVERED is valid, but PLACED → DELIVERED is not.

If the consumer receives event 12 before event 11, it should know something is off. If it receives an old transition that no longer makes sense, it should reject it instead of applying it.

The lesson: strict ordering is expensive, fragile, and easy to lose at scale. Good systems use ordering only where necessary and don't bet the whole workflow on the hope that events will always arrive in the right sequence.

And once you stop trusting sequence, another flaw becomes impossible to ignore: what happens when one action needs to update the database and publish an event at the same time?


3) The dual write problem creates inconsistencies that look invisible at first

This is one of the easiest mistakes to write — and one of the hardest to clean up.

saveOrder(order)        // writes to the database
publish(OrderPlaced)    // sends to the broker

Two lines. Two successful API calls in the happy path. One serious correctness problem.

Those two writes don't share a transaction boundary. Either side can fail independently:

  • DB commits, broker publish fails → the order exists, but inventory, billing, shipping, and analytics never hear about it. Source of truth says it's real; everyone else acts as if it isn't.
  • Broker publish succeeds, DB transaction rolls back → downstream services react to an order that never truly existed. Phantom work spreads through systems that trust events as facts.

The dual write problem is dangerous because it doesn't usually cause an outage. It causes silent divergence.

Exactly-once doesn't save you here

Kafka's "exactly-once semantics" help with atomicity and dedup inside Kafka workflows. They do not make your database write and your broker publish one atomic step — those still live in different systems.

The transactional outbox

The most practical default fix:

Transactional outbox pattern

Instead of writing the business data and publishing in the same request, the service writes two records in one local DB transaction:

  1. The business change (the order row).
  2. An outbox row representing the event to be published.

If the transaction commits, both persist together. If it fails, neither does. A separate publisher process later reads the outbox table and sends the events to the broker.

Why this is stronger: the atomic boundary moves to the database, which you already control. You stop pretending the broker and database can behave like one atomic write — they can't, because they live in different systems. The database becomes authoritative, and you publish from durable intent.

What the outbox doesn't fix

  • The publisher still retries → events can still be published more than once → consumers still need idempotency (see section 1).
  • The publisher needs ordering discipline. If parallel workers or retry logic publish outbox rows out of sequence, you can reintroduce ordering bugs in the very layer that was meant to make delivery safer.

CDC: a sibling approach

Log-based Change Data Capture solves the same problem from another angle. Instead of explicitly writing an outbox row, CDC captures committed changes from the database transaction log and turns them into events.

It works well when a team wants the DB log to drive integration, but it adds operational overhead: connectors, lag, replay concerns, schema mapping, and debugging through infrastructure outside your application code. Outbox is usually a simpler starting point.

If writing data and publishing events can split your truth, then changing the shape of those events can split your meaning.


4) Schema changes break systems quietly because contracts fail quietly

A producer changes an event schema. The deployment goes through cleanly. That's often the moment the trouble starts.

Possible "small" changes:

  • A field is renamed.
  • A required field is removed.
  • A number becomes a string.
  • A nested object is restructured.
  • The meaning of a field changes even though the name stays the same.

From the producer's side, it looks harmless: their service compiles, tests pass, deployment is green. But downstream consumers may still depend on the old shape or semantics. Some fail loudly. Others keep running and do the wrong thing quietly — which is usually worse.

This happens because teams treat events like implementation details. They are not. Once another service depends on an event, it's a contract — and contracts need change discipline.

Compatibility rules

  • Backward-compatible changes by default. Adding an optional field is usually safe.
  • Removing a required field is usually not.
  • Renaming a field looks tiny in the producer's codebase but behaves like a breaking API change to every consumer.
  • Changing semantics without changing shape is the most dangerous of all — it preserves technical compatibility while breaking business meaning.

Tools that make this less painful

  • Versioning makes change visible.
  • Schema-aware formats like Avro or Protobuf force teams to think about compatibility instead of drifting casually.
  • CI compatibility checks catch bad changes before they spread.
  • A schema registry in production enforces compatibility rules before a breaking change reaches consumers.

But even with all of that, sometimes a producer must move to a new schema before every consumer is ready.

Adapters earn their place

Adapter translating new schema to old

An adapter translates the new event shape into an older one so legacy consumers keep working while the system migrates gradually. It's not glamorous, but it:

  • Reduces synchronized deployments.
  • Avoids forcing five teams to coordinate one rollout window.
  • Lets the system evolve without turning every schema change into an organizational incident.

The deeper point: schema evolution isn't just a serialization concern. It's a coordination concern. Shared events mean shared responsibility.


The real problem is not events. It is designing only for the happy path

Duplication, reordering, dual writes, and schema drift look like four separate issues — but they come from the same mistake.

Teams design the happy path in detail and treat failure behavior as cleanup work. They spend their time on:

  • What events exist
  • Who publishes them
  • Who subscribes to them
  • How the workflow behaves when everything goes right

…and much less time on redelivery, replay, stale data, partial failure, delayed delivery, and contract evolution. That's exactly why so many event-driven systems look elegant on a whiteboard and turn messy in production.

Don't judge an event-driven system by how clean its event flow looks on a diagram. Judge it by what happens when the same message arrives twice, when event 8 shows up before event 7, when the broker is healthy but the publish still fails, and when one team changes an event contract without fully understanding who depends on it.

The right takeaway is not "avoid events." It is to design them with the same seriousness you'd apply to a database schema or a public API.


What to internalize

If you build event-driven systems, assume four things from day one:

  1. Duplicates will happen → consumers must be idempotent.
  2. Order will break somewhere → don't depend on global ordering; partition by entity, version your aggregates, validate transitions.
  3. Two-step writes will eventually diverge → use the transactional outbox (or CDC) to make the database the atomic boundary.
  4. Schemas will change before everyone is ready → treat events as contracts, use schema-aware formats and registries, and reach for adapters during migrations.

If your design doesn't account for these, it isn't finished.

That doesn't make event-driven architecture bad — it makes it honest. Async systems can decouple teams, absorb load well, and enable powerful integration patterns. But they don't remove complexity; they relocate it into retries, consistency boundaries, replay, and contracts. You still pay the bill — you just pay it in different places.

That is why event-driven systems are easy to build and hard to keep correct.

#AI#ENGINEERING#CI_CD#GITHUB#AUTOMATION#GROWTH

Author

Raul Junco

The weekly builder brief

Subscribe for free. Get the signal. Skip the noise.

Get one focused email each week with 5-minute reads on product, engineering, growth, and execution - built to help you make smarter roadmap and revenue decisions.

Free forever. Takes 5 seconds. Unsubscribe anytime.

Join 1,872+ product leaders, engineers & founders already getting better every Tuesday.