I was once part of a team working on an integration layer responsible for collecting data from many upstream services and processing it into content made available to multiple external customers according to their requirements.
Our job was to make upstream and external system work as a single coherent whole, despite having no control over either side. This is a classic setting for emergent chaos, and I think of it as “a place between places”.
The first few integrations we released seemed fine: there was silence, which is usually a sign of success in this kind of work. At the time, we were focused on building new integrations.
But we were about to learn just how much noise errors can make.
The Chaos (as in Deterministic Non-Linear Systems)
After releasing one of our early integrations, we started seeing recurring operational issues and growing frustration. Our services were behaving unpredictably and breaching key expectations daily. We had no alerts and no sense of the severity or frequency of errors. We learned about problems only when customers reported them.
In environments like this, panic can spread quickly. People reach for simple explanations or “obvious” culprits in an attempt to regain a sense of control. There’s usually only one way out of this situation: stay calm, put a tactical solution in place to stop the bleeding, buy time to find the real cause. Then, create a plan and explain the situation to key stakeholders.
This particular integration was more complex than the earlier ones, and small data discrepancies (introduced upstream for “convenience”) had unpredictable and severe consequences. Hundreds of pieces of content were becoming unavailable to customers every day – always exactly ninety days after they were published. We also began seeing severe DoS-like incidents due to missing idempotency across the board. Until then, we had no idea our own services could generate duplicates.
Not only did we have little control over the systems around us, but our own stack had become opaque to us.
The original plan was to fix the bugs we had discovered and continue building new integrations to meet the deadline. But was that even possible?
The New Plan
At a high-level, there was a single root problem:
Our stack did not provide the right level of abstraction.
Abstract too little, and your services become another layer of indirection with extra cost and no value. Abstract too much, and it won’t be possible to meet all the diverse requirements of downstream systems.
So we landed on a “simple” solution:
Provide a unified interface that shields developers from upstream complexity and allows us to focus on external integrations.
I spent many days thinking about this interface: a beautiful, coherent API with clear semantics paired with proper update events, enabling us to build scalable, reliable and predictable integrations.
How you pursue a goal like this depends heavily on your organisation and technical landscape. In our case, we realised that the “simple plan” required re-architecting the entire integration pipeline, something no stakeholder wants to hear. We needed a platform.
The New Integration Stack
Once we had the green light to redesign our new integration stack, we split it into two main components: the Platform and the Adapters.
External services never interacted with the platform directly. Instead, each adapter subscribed to the Platform’s new-data-available notification, consumed only the data required for its particular integration, and transformed it according to the rules and constraints of a specific external system.
This setup brought several benefits:
- Focus: Devs could concentrate entirely on the requirements of a single external system.
- Isolation: Integration logic was fully contained within the service responsible for that system.
- Scalability: Each adapter could scale up or down independently according to the throughput demands of the system it supported.
- Adaptability: A consequence of independent scalability, we could limit the throughput of specific integrations to avoid overwhelming downstream services.
- Operational Tooling: Each adapter became the natural place to build tools to address that system’s recurring operational issues.
- Speed of Development: More developers could work in parallel on new integration without stepping on each other’s toes.
The main challenge with this approach was balancing innovation with consistency across adapters. A small group of developers maintained more than twenty services, so consistency in design and implementation was essential. We could not afford to let teams reinvent solutions in isolated ways that ignored valuable lessons learned elsewhere in the system.
Adapters required a reliable central source of all the data they depended on, delivered with reasonable latency, correct new-data-available notifications (signals, not payloads), and consistent behaviour. The Platform offered that and more.
Enter the Platform
Chaos had become routine. Missed deliveries, costly duplicates, frozen stack. All silently building up from the unlucky combination of tiny, isolated events compounded over time, amplified by total absence of observability.
The upstream services were owned by different teams and reflected different business domains. Each was part of a non-trivial pipeline and had its own upstream dependencies. They behaved unpredictably at times, produced conflicting updates, and generated thousands of duplicate events every day. An environment like this is not just noisy; it’s adversarial.
To bring order to this landscape, we needed a layer capable of collecting data from all these disparate sources and exposing a coherent view through an unified, versioned API. Key to achieve those goals was splitting the Platform into two main components: Ingestion and API Layer.
Ingestion:
We needed high-throughput ingestion while preserving event ordering, so we defined a deterministic pipeline:
consume event -> pre-process -> de-duplicate -> decode into domain events -> process -> store
This structure unlocked several capabilities:
- Early anomaly detection: Discard malformed or inconsistent data before they entered the Integration stack.
- Coherent Integration view: Decode data into a unified representation of the broader business domain, forming a clear contract between Upstream -> Integration -> External Systems.
- Stable downstream consumption: Data was stored in queryiable form, protecting adapters from upstream instabilities.
- Source-level insights: Track and analyse events by source to understand upstream behaviour.
- Extended diagnoses tooling: Persist raw events and build tools around them to support analyses and troubleshooting.
- Flow-event reduction: Use of heuristics to suppress low-value updates, and avoid unnecessary cascades through the platform.
We used Kafka – combined with a carefully chosen partitioning strategy – to achieve high-throughput parallel processing and preserve events ordering for each upstream source. This allowed us to deliver correct, time-sensitive content in a deterministic way.
By configuring each topic with compaction and infinite retention, we enabled log-based replication. This gave us the ability to replay the entire pipeline deterministically, a massive win in terms of reliability.
API Layer:
Downstream teams interacted with the Platform through a client library backed by a few GraphQL endpoints. They defined a query describing the data their adapter needed in order to produce content in accordance with system contracts.
The client library brought several benefits: type-safety, auto-completion and versioning.
This API Layer model was heavily challenged, and I used the lens of category theory morphisms to help me clarify our approach: building a structure-preserving projection from the full platform domain model to a minimal, integration specific view required by each external system.
The benefits:
- Adapters received exactly the data they needed and nothing more.
- A lossless, deterministic view of the model, enabling safe-retries.
- Multiple views of the domain without fragmenting the ingestion model.
The tradeoff with this approach was a descent, but not “light speed”, latency. Beyond the network call, other factors added to the overhead: operational tasks embedded in the library, the GraphQL layer translating queries to SQL, and PostgreSQL executing them. This was acceptable in our case: throughput mattered far more than single-request latency.
Idempotency
The Platform also integrated well with our new deduplication layer. A single duplicate event (a “spark”) could trigger costly downstream work with potentially exponential consequences (“fire”).
The challenge was that duplicates were semantic, not transport level. Different external systems (via adapters) consumed different projections of the same underlying entity, meaning that different source updates could collapse into the same business update downstream.
We needed idempotency, not as a transport concern, but as a business outcome.
Low-Value Updates
The Platform also revealed another class of issues: low-value events that looked like genuine updates to the integration stack and bypassed the deduplication entirely. This caused expensive operations or even downstream disruptions.
The Success and The Day After
New adapters powered by the Platform were released one after another, from relatively simple to profoundly complex. We had built more than a stable platform; we had created a recipe to plug new systems into our stack. Aside from the occasional celebration after a successful launch, there was mostly silence: the sound of “nothing is broken”.
It was a success.
Then came a new flagship product. New services, new people, new stakeholders. I assumed this one would be more straightforward: we had proven we could navigate uncertainties and carve order out of chaos. I was wrong. The new voices carried none of the scars of the long fire we had walked through, yet they spoke confidently. And back to the frontline.
We were a team built to succeed in a bounded context. We never really tame chaos. We only address it. You build a temporary shelter, layer by layer, one challenge after the other. And when the storm passes, another will eventually come. That’s the work.