I’ve recently written about navigating the uncertainties of a complex integration stack. One of the biggest recurring issues we faced was duplicates: bursts of repeated updates that caused major incidents for our customers.
Our core operations were not idempotent – multiple identical instructions were performed with side-effects as many times as they were issued.
In our domain, it caused several problems:
- Heavy, expensive reprocessing on both our side and our customers’ side.
- Long-running computations that could take hours to complete.
- In extreme cases, hundreds of duplicate instructions piling up and blocking time-sensitive work.
This was unsustainable. We needed a deterministic, systemic fix.
This post walks through:
- The primary cause of non-idempotent behaviour in our pipeline
- Why idempotency was essential to preserving business outcomes
- How we redesigned the system to eliminate duplicates without altering semantics.
The Problem: Duplicates After Projection
Designing an idempotent distributed system often requires handling duplicates that arise beyond transport-level guarantees such as at-least-once deliveries.
Our pipeline ingested and merged data from multiple sources, then produced several different downstream projections. These projections reflected our business domain: each customer required a different subset of the combined data.
Since each projection was a view of the merged data, source updates could collapse into the same customer-facing representation – a duplicate.
To illustrate, consider the example:
Combined Data:
{
"banana_id": "B-007",
"name": "Mr. Banana",
"ripeness_level": "Mostly Yellow"
}
Sent to Adapter/Customer 1:
{
"banana_id": "B-007",
"ripeness_level": "Mostly Yellow"
}
Sent to Adapter/Customer 2:
{
"banana_id": "B-007"
}
Now Mr. Banana’s ripeness_level changes:
Combined Data (Updated):
{
"banana_id": "B-007",
"name": "Mr. Banana",
"ripeness_level": "Perfectly Yellow"
}
Sent to Adapter/Customer 1:
{
"banana_id": "B-007",
"ripeness_level": "Perfectly Yellow"
}
Sent to Adapter/Customer 2:
{
"banana_id": "B-007"
}
Adapter 1 receives two distinct updates.
Adapter 2 receives the exact same message twice.
Combined Data → Adapter 1 → OK
→ Adapter 2 → OK
Combined Data (Updated) → Adapter 1 → OK
→ Adapter 2 → DUPLICATE! 🍌
The core issue: the underlying entity changed, but Adapter 2’s projection did not, so it ended up with a duplicate. Any adapter whose projection omits the updated field is affected in exactly the same way.
This type of deduplication must happen after projections, at the edge of the pipeline, on a per-customer adapter basis.
Deduplication sat between projection generation and adapter-specific processing, acting as a dam before downstream side effects occurred.
When a duplicate was detected, adapters skipped processing and tracked the occurrence. Deduplication errors or timeouts were handled like any other pipeline failure: through replays rather than best-effort processing.
The First Solution: Last Seen Deduplication
When the first serious duplicate-related incident hit us, a colleague proposed a straightforward and effective approach:
- Serialise the message to a deterministic JSON representation
- Hash the payload
- Retrieve the previously stored hash by ID
- Compare the previously stored hash with the new one
- If identical -> return duplicate
- If different -> return not duplicate and store the new hash.
This gave us last-seen-message deduplication. For example:
A -> B -> A -> B ->A : no duplicatesA -> A -> A -> A -> B: only the firstAandBare non-duplicates
We needed this to work immediately, and we were already late in the day (we wanted a fix before going home). A key/value store seemed like the obvious, modern choice: O(1) lookups, no schema changes, and Redis instances already running. We could implement the fix within minutes.
The dam was built. Problem solved.
How the Dam Cracked
The first failure mode was subtle. It occasionally returned not duplicate for the same message:
Sent to Service 1 (First):
{
"banana_id": "B-007",
"ripeness_level": "Perfectly Yellow"
}
Sent to Service 1 (Second):
{
"ripeness_level": "Perfectly Yellow",
"banana_id": "B-007"
}
A change in field order was enough to bypass the deduplication logic.
We needed deduplication based on structural equivalence instead of the serialised payload. Two messages should be considered equal if their JSON trees have the same fields, values and nesting, regardless of field order.
Then came the second, more serious failure, making the dam collapse: the Redis instances had no persistent storage. One day they vanished along with the entire deduplication state.
Time for a Sustainable Strategy
We considered a few alternatives before deciding how to approach deduplication across the pipeline.
- Idempotency keys / versioning: These require producers to understand semantics that only emerge at adapter, downstream-level.
- Kafka log compaction: It also operates before projection. More importantly, compaction does not guarantee that consumers only see the latest version of an event.
Neither approach addressed the real problem. We decided to build a dedicated deduplicator service.
The New Deduplicator
We kept the core idea – hash-based, last-seen-message deduplication – incorporated the lessons and extracted it into a dedicated service.
That would offer us:
- Consistent Behaviour: Same deduplication semantics makes the system easier to reason about.
- Clear Properties & Policies: Transparency around performance, failures and retries.
- Consistent Tooling: Deduplication needs operational tooling for analysis, diagnosis and reprocessing.
We envisioned a growing number of services relying on this new deduplication service.
It needed to support:
- Synchronous Check: Dowstream services decide if continue or skip processing.
- Moderate throughput: < 100 RPS average target to process millions of requests per day with capacity headroom for bursts.
- Bounded Latency: 99th percentile < 250ms target; worst-case rare slow paths < 2s tolerated without triggering clients timeouts.
- Strong Consistency: Linearizable dedup decisions – no false positives/negatives, even under concurrency.
- Domain Agnostic: Perform structural comparisons on deterministic representations defined by the clients. Avoid JSON canonicalision or semantic equivalence to avoid domain interpretation, restricting flexibility of usage.
- Application-level Retries: Retries must restart from the beginning of the flow to prevent processing stale data.
Core Invariants:
Safe and Deterministic
Each deduplication request was identified by a composite key:
messageId: Corresponded to the underlyingentityId; globally unique and stable across the business.serviceId: Mapped to a projection of that entity, preventing cross-services interference.
This ensured isolated and deterministic comparisons.
Preserving Order Guarantees
Events within the pipeline were partitioned by entity identifier (the messageId received by the deduplicator). Kafka, combined with a sound partition strategy, preserved ordering within each partition.
Deduplication was a synchronous step in the pipeline, so requests arrived in the same order as events.
For any given (serviceId, messageId), deduplication decisions were strictly ordered.
Safe Retries
The pipeline ensured that only the latest version of an event could be replayed, preventing stale data during reprocessing.
If processing failed after deduplication, we used operational tooling to reset the dedup state for a specific (messageId, serviceId) before replaying the event.
This path was rare and manual by design.
Hash Collisions: Theoretical, Not Practical
We used cryptographic hash of the parsed JSON tree to detect changes.
While hash collisions are theoretically possible – with the last-seen deduplication approach – they would require two distinct and successive JSON structures for the same key to hash identically. Given non-adversarial inputs and strong hash functions, we treated collisions as practically impossible.
Guaranteeing absolute collision freedom would have added significant complexity for negligible benefit.
Bounded State
The service only stored keys and the associated latest hashed message.
Thus, the state was bounded by:
number of entities * number of projections
The number of entities was large, but finite and well-understood.
Correctness over Availability:
Wrong deduplication decisions were unacceptable in our domain:
- False negatives => expensive repeated work.
- False positives => missed deliveries.
We needed linear behaviour: requests for a given (messageId, serviceId) pair must be processed one-by-one, preserving ordering by key.
This ruled out caching: stale reads would have broken correctness.
The Database Layer
We initially implemented the service using PostgreSQL with SERIALIZABLE isolation to guarantee linearizability. The goal was to validate the approach, build new integrations on top of it and eliminate recurring operational incidents.
This version of the service ran in production for several years and sustained the required throughput.
It worked in practice because contention was limited:
- There was no contention between deduplication requests triggered by different services.
- Concurrent deduplication requests for the same
(messageId, serviceId)pair were rare. - Clear retry policies ensured correctness when contention did occur.
Performance Improvements
The initial version traded latency by certainty and speed of development. As adoption grew, traffic patterns evolved. Bursts became more frequent, and tail latency (p99) occasionally climbed to seconds.
Two hot paths were tightened to address this without changing semantics and invariants.
Atomic Compare-and-Swap
The core duplication logic already had a compare-and-swap pattern:
- Update the stored hash only if it differs from the previous value.
- Compare and update atomically.
- Operate on state stored in a single database row.
It was reduced to a single atomic operation.
UPDATE dedup_state
SET hashed_message = :new_hash
WHERE message_id = :message_id
AND service_id = :service_id
AND hashed_message IS DISTINCT FROM :new_hash
RETURNING 1;
Thus, in this case, the same correctness guarantee could be achieved using READ COMMITTED isolation level.
Replacing SERIALIZABLE improved latency and throughput by decreasing overhead, lowering contention and eliminating retries caused by transaction conflicts.
Faster Structural Comparisons
JSON comparisons were on the hot path: every request depended on it. Improving it would have system-wide impact.
The initial comparison was built with an abstraction-rich, allocation-heavy implementation.
A memory-efficient recursive tree walk reduced memory allocation, garbage collection pressure, GC-induced stalls, and, consequently, tail latency.
Infrastructure Overview
This is how the pieces fit together:
- Horizontally scalable API layer behind a load balancer.
- Stateless application nodes, with state stored in the database, allowing ephemeral, replaceable web servers.
- Graceful degradation under load: Unbounded request queues resulted in slower responses under heavy-load instead of dropped requests.
- Single-region deployment aligned with our business constraints. Global linearizability in a multi-region setup would have required a fundamentally different approach (e.g. distributed SQL database).
Observability
We used Prometheus and Grafana to track:
- Latency, globally and per service
- Request volumes, globally and per service
- Duplicate rates
- System-level metrics.
This visibility helped us to identify the biggest sources of duplicates and eliminate unnecessary upstream noise.
Single Point of Failure (SPOF)
As a consequence of our architectural decisions, the deduplicator and its PostgreSQL database became a single point of failure.
In case of an outage, the correct behaviour was to halt processing instead of allowing the system to continue with incorrect deduplication.
In such scenarios, alerting and well-defined response procedures are essential for timely recovery and for turning a potentially stressful incident into a clear operational task.
Conclusion
Over several years of operation, the deduplication layer never failed us.
We had concrete results:
- Fewer major incidents
- Improved customer trust
- Lower operational costs
- Stronger delivery guarantees.
Serendipity
Along with the new pipeline and the tooling we built around it, we discovered an additional benefit: the ability to filter unnecessary events during ingestion, before they entered the system.
Many events differed only in fields irrelevant to our domain, such as timestamps. By stripping out these before deduplicating projections, we discarded thousands of redundant updates every day.
Despite fan-out increasing beyond 20, the number of daily events flowing through the pipeline decreased significantly.
By eliminating redundant updates early, we kept the load on the deduplicator itself under control.
TL;DR
In this pipeline, idempotency wasn’t a transport-level concern but a business one.
Treating it that way shaped how we designed the system and the tradeoffs we were willing to make.

