From Accidental to Intentional Architecture

The momentum of building a product, shaping abstract ideas into something concrete, comes with compromises and limitations. We quickly connect components and take shortcuts to speed up deliveries. This sets the scene for unpredictability.

Even a prototype is composed of several interacting components with growing complexity. That complexity leads to emergent behaviour:

A system-level property that arises from the interaction of simpler components and is not defined by any single component.

Emergent behaviour can be neutral or positive. Here, I’m specifically interested in emergent failure modes – undefined or hard to enforce invariants, scalability issues, and unexpected runtime errors.

Architectural pressure is the force cracking the design in a specific place leading to failure modes. By recognising such pressure, we feel the drive to evolve the architecture.

It signals that requirements have outgrown current design and highlights the necessary next steps. We discover these cracks through incidents or by deliberately pushing on areas that seem sensitive.

This essay uses the initial design of a web crawler to demonstrate emergent failure modes as a concrete phenomenon, and how recognising architectural pressure can significantly expand capabilities.

Part 1. Emergent Behaviour

Meet the Crawler

I need a crawler to search for signals through changes in specific web pages. For example, new trends in software engineering.

The first version relied on an Orchestrator to drive the entire flow: initiates crawling through a seed URL, fetches the web page, process the page URLs, and only enqueue supported, normalised, unseen URLs.

Pages are processed concurrently. Each crawled page is processed sequentially as described in the diagram above.

The process function implements the pipeline above.

▶ Orchestrator#process (click to expand for simplified Scala code)

			
private def process(
  queue: Queue[F, Uri], inflightWork: Ref[F, Long]
)(url: Uri): F[Unit] =
  // enqueue URL and track in-flight work
  def enqueue(url: Uri) =
    inflightWork.update(_ + 1) *> queue.offer(url)
  (for
    // fetch webpage
    htmlPage <- fetcher.fetch(url)
    // extract urls from the page
    pageLinks <- linkExtractor.extract(htmlPage)
    // emit all page urls to stdout
    _ <- emitter.emit(pageLinks)
    acceptedLinks = pageLinks.links.filter(urlFilter.accept)
    // canonicalise urls
    canonicalisedLinks = acceptedLinks.map(UrlNormaliser.canonicalise)
    // deduplicate urls
    deduplicatedLinks <- canonicalisedLinks.toVector.filterA(link =>
      deduplicator.hasSeen(link).map(!_)
    )
    // add all valid newly discovered urls to the queue
    _ <- deduplicatedLinks.traverse_(
      enqueue
    )
  yield ())
    .guarantee {
      // decrease in-flight work counter
      inflightWork.update(_ - 1)
    }

		

There are 3 fundamental components powering the crawler:

A bounded URL queue preventing OOM
Fibers supporting cooperative multi-tasking
Orchestrator defining the data flow and key invariants such as termination.

Emergent Design Pressure

All components work as expected locally:

The queue blocks operations when full
Fibers enable concurrent page processing without blocking threads
Orchestrator composes the dataflow and makes sure termination is correct.

Observe the failure mode:

Orchestrator discoveries many URLs for each URL it processes
The bounded queue reaches capacity
Each Orchestrator concurrent task finishes processing the page
Since the queue is full, in-flight tasks are suspended when enqueuing newly discovered URLs
Since in-flight Orchestrator tasks are suspended, they cannot dequeue URLs
Termination depends on completing all in-flight work
Result: deadlock. The crawler is unable to make progress or terminate.

Components are locally sound. Their interaction leads to an undesirable global property: the application may stall with very high page URLs fan-out.

There is a missing invariant: the crawler must always be able to make progress regardless of the load.

The liveness issue is an accidental global property. It is an architectural pressure: a signal that the crawler design doesn’t meet requirements. It needs immediate attention.

The Source of Tension

That tension originates from the initial, deliberate coupling between control-flow and data-flow in the Orchestrator.

That coupling also makes it very difficult to implement politeness and fairness.

These limitations were accepted for a controlled prototype. The deadlock was not – I stumbled upon the liveness issue.

Part 2. Conscious Architectural Evolution (From Emergence to Intent)

Planning Next Steps Guided By Pressure

The liveness issue and the difficulty in implementing key features are direct consequences of the current design: the Orchestrator couples work admission, control and data processing in a single component.

These pressures point to the next architectural evolutionary steps:

Convert the Orchestrator into a Worker, responsible only for processing URLs
Introduce Ingress to admit URLs into the system
Admission must never stall Workers (it may block on I/O)
Control URL dispatch with a Scheduler.

▶ Scheduler (algebra)

			
trait Scheduler[F[_]]:
  def dispatch: fs2.Stream[F, Uri]

▶ Ingress (algebra)

			
trait Ingress[F[_], T]:
  def publish(events: T): F[Unit]
  def publish(events: Vector[T]): F[Unit]
  def stream: fs2.Stream[F, T]

The initial control policy is simple: the Scheduler dispatches URLs to workers at a steady rate.

The Ingress sole responsibility is transport: it’s the entry point in the system and has well-defined semantics:

Idempotently admit URLs into the system
Expose a stream where each admitted URLs is emitted exactly once.

The diagram below demonstrates the revised architecture:

▶ FifoScheduler#dispatch

			
class FifoScheduler[F[_]: Temporal] private (
  source: fs2.Stream[F, Uri], // from Ingress
  dispatchInterval: FiniteDuration // rate limiting work admission
) extends Scheduler[F]:
  def dispatch: fs2.Stream[F, Uri] =
    source.metered(dispatchInterval)

		

▶ Ingress.makeQueuedIngress (in-memory)

			
object Ingress:
  def makeQueuedIngress[F[_]: Temporal, T](
    queue: Queue[F, T] // unbounded queue
  ): Ingress[F, T] =
    new Ingress[F, T]:
      override def publish(event: T): F[Unit] =
        queue.offer(event)
      override def publish(events: Vector[T]): F[Unit] =
        events.traverse_(queue.offer)
      override def stream: fs2.Stream[F, T] =
        fs2.Stream.fromQueueUnterminated(queue)

		

▶ Worker#process

			
// Unbounded `targets` stream replaces queue
class Worker[F[_]: Async](targets: fs2.Stream[F, Uri], . . .):
  def run = targets.parEvalMap(maxConcurrency)(process)
  private def process(url: Uri): F[Unit] =
    def publish(newlyDiscovered: Vector[Uri]) =
      tracker.track(newlyDiscovered.size) *>
        ingress.publish(newlyDiscovered)
    (for
      htmlPage <- fetcher.fetch(url)
      pageLinks <- linkExtractor.extract(htmlPage)
      _ <- emitter.emit(pageLinks)
      acceptedLinks = pageLinks.links.filter(urlFilter.accept)
      canonicalisedLinks = acceptedLinks.map(UrlNormaliser.canonicalise)
      deduplicatedLinks <- canonicalisedLinks.toVector.filterA(link => deduplicator.hasSeen(link).map(!_))
      // Publish new links to Ingress instead of offering to queue directly
      _ <- publish(deduplicatedLinks)
    yield ())
      .guarantee {
        tracker.completed()
      }

		

These changes fix the liveness issue and unlock new options. They do not provide fairness or politeness yet, but introduce a central point where these new policies can be defined.

They also introduce a new pressure.

Pressure Shifts

In the original design, the Orchestrator both produced and buffered URLs using a single bounded queue.

Now, Workers immediately publish discovered URLs. The Ingress emits a stream of admitted URLs, and the Scheduler dispatches them to Workers at a controlled frequency. Because most webpages contain many links, Workers produce URLs far faster than they can process.

The in-memory Ingress implementation relies on an unbounded queue to hold pending URLs. This removes the deadlock, but shifts the pressure elsewhere: newly-discovered URLs grow without bounds, potentially leading to OOM.

The redesign does not eliminate the risks created by the in-memory queue. It isolates the risks.

If memory-growth becomes a problem, the in-memory Ingress can be swapped for a persistent implementation.

▶ PostgresIngress (durable)

			
object PostgresIngress:
  def make[F[_]](): Resource[F, Ingress[ConnectionIO, Uri]] =
    Resource.pure(new PostgresIngress())
private class PostgresIngress private () extends Ingress[ConnectionIO, Uri]:
  override def publish(url: Uri): ConnectionIO[Unit] =
    Statements.insertUrl(url)
  override def publish(urls: Vector[Uri]): ConnectionIO[Unit] =
    Statements.insertUrls(urls)
  override def stream: fs2.Stream[ConnectionIO, Uri] =
    fs2.Stream.repeatEval(Statements.dequeueUrl).unNone

		

Since the architecture is defined by contracts, the other parts of the system remain unchanged.

The in-memory Ingress still provides several benefits:

Extremely high throughput
No external dependencies
Fully non-blocking operations.

Keeping both implementations provides options for different execution environments: simple, fast, and riskier vs. safer, slightly slower and operationally heavier.

The crawler evolution thus far:

			
Prototype
   ↓
Deadlock
   ↓
Admission (In-memory ingress)
   ↓
Control / URL processing separation
   ↓
Optional durable ingress.

		

Part 3. Politeness & Fairness (A Gentle and Effective Crawler)

As it stands, the crawler can only be used under tight control, for example by restricting it to a single domain. Otherwise it risks behaving as a DoS tool.

To unlock its full potential – extracting meaningful signal from a diverse set of webpages – it must implement politeness and fairness.

Politeness requires an interval between requests to the same domain.

Fairness preserves throughput. While waiting for the cooldown period of one domain, the crawler can continue to make progress by fetching URLs from others.

With a clear separation between admission, control and processing, supporting the new control policies can be achieved by swapping the current FifoScheduler for a more involved Coordinator. The other parts of the crawler remain intact.

Enter the Coordinator

The Coordinator uses a priority queue of DomainQueues to find the next eligible URL
A DomainQueue is initialised during admission, when the first URL for a domain arrives
The priority queue keeps the DomainQueue with the earliest next-eligible URL at the head, from which the next URL is taken
After dispatching a URL, the Coordinator advances the domain’s eligibility time by the cooldown period.

Since each domain has its own queue, hot domains do not starve lower-traffic ones.

Internally, a TreeSet ordered by (nextEligibilityAt, domain) provides priority queue semantics – preferred over a binary heap since reordering requires efficient removal, which TreeSet can handle in O(log n).

The Coordinator – the new Scheduler implementation – is more involved than the previous snippets. Both the source and the spec are available on GitHub.

The pressure that initially signaled a broken design has, in the end, helped shape a better one.

With a few remaining features in place, the crawler will reach a point where I can shift the focus from building a crawler to extracting signal from the vast web.

Reflections

In practice, the most careful and competent engineer will be surprised and think “How haven’t I seen this before?” at some point. And that is the work.

Intelligence, creativity and sophistication often compound the problem, particularly at a system-level. Early adoption of technologies, elaborate interactions between components and abstractions lead to unjustified complexity when not strongly grounded on actual necessity. This often translates into fragility. Boring and battle-tested is good.

A simple heuristic is to start simple and address emergent failure modes as they appear, identifying and isolating the architectural pressure as early as possible, using it to inform deliberate architectural evolution.

Missing invariants may be inevitable. The architectural pressure will surface them. Let it guide the next evolutionary step.

Transaction Boundaries Are a Business Concept

In most systems, use cases live in the application layer, where services orchestrate domain operations and side-effects such as persistence, messaging, or external calls.

It’s also common for the business to require locally atomic operations: they must succeed or fail together, preserving state invariants in the presence of failure.

How to implement transactions is an infrastructure concern. Defining transaction boundaries is a business requirement.

In the following sections, I will present a technique that leverages the type system to clearly define and make transaction boundaries explicit in code. Concerns such as isolation and durability remain infrastructure responsibilities.

This results in:

Compositionality: Build complex transactions from simpler ones.
Testability: Verify atomicity without relying on infrastructure.
Compile-time enforcement: Transactional programs cannot run unless explicitly committed.
Safe Refactoring: Reduce the chances of incorrectly assuming operations are atomic.

Together, these properties lead to clearer, refactoring-safe, and better-tested codebases.

This approach is well suited when the ambiguities of complex business requirements and failure modes outweigh the conveniences of mainstream, battle-tested practices, such as annotating types and methods with @Transactional.

The examples in this article are implemented in Scala, but the approach itself is not Scala-specific. The key is a strong type system and the ability to describe programs as values, separating the description of a program from its execution.

The code snippets presented here are intentionally simplified to highlight the transactional boundaries and failure modes. They omit part of the testing setup and simplify use cases and assertions. Actual implementations for readers who want to dive deeper are linked at the end.

Where is the transaction boundary?

Suppose that we need to implement a ‘create account’ use case consisting of:

generating credentials
persisting the new account in the storage
granting the user access.

From a business perspective, these steps must be atomic. We could start defining an API like:

trait UsersManager[F[_]]:
  def createAccount(username: Username, password: PlainPassword): F[UserId]

F[_] allows us to abstract over the effect type such as side-effect, state, or failure. In practice, it will be replaced by IO or equivalent in production. And IO[UserId] is a description of a computation that produces side-effects: storing data, in this case.

One possible implementation is to sequentially store the new account and user permissions after generating credentials.

      override def createAccount(username: Username, password: PlainPassword): F[UserId] =
        val credentials = . . . // pure domain logic

        val setUpAccount = for
          userId <- store.createAccount(credentials)
          _ <- accessControl.grantAccess(userId, userId, ContextualRole.Owner)
        yield userId

The code is sequential and easy to follow. But where is the transaction?

Perhaps store.createAccount starts its own transaction?
Perhaps accessControl.grantAccess does the same?
Or maybe there is an implicit transaction scope that will be executed at runtime with a framework?
Or there is no transaction at all?

The conveniences of implicit transaction boundaries via frameworks come with tradeoffs:

Business transaction boundaries don’t compose well. This often leads to an explosion of specialised Service or Repository functions, one for each atomic use case.
Refactoring can silently break atomicity.
Verifying transactional guarantees depends on infrastructure (integration tests).
Type signature conveys no information about transactional tasks.

Improving Atomicity Correctness Guarantees

The goal is to be unequivocal on where a transaction starts and ends.

Txn defines the operations that must be performed atomically. It conveys the same intent as operations annotated with @Transactional, but it leads to clear, local and enforced by the type system boundaries.

      override def createAccount(username: Username, password: PlainPassword, globalRole: Option[GlobalRole]): F[UserId] =
        // Pre-transaction: pure domain logic
        val credentials = . . .

        // Atomic operations
        val setUpAccount: Txn[UserId] = for
          creds <- tx.lift { credentials } // Embed a pure computation into a transactional context
          userId <- store.createAccount(creds) // Storing capability via Port/Repository
          _ <- accessControl.grantAccess(userId, userId, ContextualRole.Owner) // Same as above
        yield userId

        // Signal a commit, no side-effects yet.
        tx.commit { setUpAccount }

Both F and Txn are descriptions of computations: F is the runtime, the final work performed by the program. Txn describes which operations must be executed in sequence and atomically within a transaction. Both are values that need to be interpreted in order to be executed.

Crucially, commit must be invoked in order to obtain F[UserId] from Txn[UserId]. Otherwise the code won’t compile.

The Core Abstraction

The transactional behaviour is captured by a minimal API:

trait TransactionManager[F[_], Txn[_]]:

  val lift: [A] => F[A] => Txn[A]

  val commit: [A] => Txn[A] => F[A]

The TransactionManager API exposes two key operations:

lift embeds non-transactional operations into a transactional context
commit signals that the transaction must be committed and rolled back in case of an error, transforming it into an executable operation.

Embedding non-transactional code into a transactional context

Any non-transactional operation required to run within a transaction must be explicitly embedded into a Txn context via lift:

tx.lift { clock.realTimeInstant }

This is a key feature:

Developers must acknowledge that such operations need special handling in case the transaction is rolled back.

Verifying Atomicity Under Failure

The happy path is straightforward: when all operations succeed, the state remains consistent by construction. The more interesting case is when there is a failure midway through a transaction.

The test below verifies our business invariant: when granting permissions fails after account creation succeeds, no account is persisted.

test("user account is not created when granting permission fails"):
    forAllF { (username: Username, password: PlainPassword) =>
      for
        // given
        (usersStore, storeRef) <- makeEmptyUsersStore()            // account creation succeeds
        failingAccessControl  = makeUnreliableAccessControl() // granting permission fails
        tx = makeTransactionManager(List(storeRef))                  // test-specific transaction manager
        usersManager = UsersManager.make[IO, IO](usersStore, failingAccessControl, tx, ...)

        // when
        _ <- usersManager.createAccount(username, password).attempt

        // then
        account <- usersStore.fetchAccount(username)
      yield assert(account.isEmpty) // no partial update
    }

We can test atomicity without a real database by using:

An alternative execution strategy for the transactional program (Txn[_])
In-memory implementations of Ports/Repositories.

These in-memory Port components are not mocks or stubs: they preserve the same semantics as production components, differing only in how state is stored (in-memory vs. database).

The Execution Strategy for Testing

In production, commit executes the transactional program using a database-backed transaction manager (e.g. via Doobie).

For unit tests, we can provide a different execution strategy that simulates transaction boundaries using in-memory state.

  def makeTransactionManager(refs: List[TxRef[?]]): TransactionManager[IO, IO] =
    new TransactionManager[IO, IO]:
    
      override val commit: [A] => IO[A] => IO[A] = [A] =>
        (action: IO[A]) =>
          action.attempt.flatMap {
            case Right(a) => refs.traverse_(_.commit) *> IO.pure(a)
            case Left(e) =>   refs.traverse_(_.rollback) *> IO.raiseError(e)
        }
      . . .

Semantically, this version of the TransactionManager:

Stages state changes while the transactional program runs
Commits all staged changes on success
Rolls them back entirely on failure.

What This Simulation Provides (And Doesn’t)

The test execution strategy relies on stateful components (such as TxRef) to support staging, commit and rollback operations.

This is not a full transaction implementation and is intentionally simpler than database-backed transactions. The testing machinery:

Does not provide isolation, concurrency guarantees or durability
Supports independent unit tests that verify atomic business invariants under failure modes.

Composition

Transaction boundaries are explicit and local, emerging from how Txn[A] transaction programs are composed:

Composing multiple Txn values and committing them together results in a single transaction.
Committing Txn values separately results in multiple transactions.
Omitting commit prevents execution and results in compilation error, since Txn must be converted to F via commit.

This compile-time composition model makes runtime transaction propagation intentionally unnecessary.

Tradeoffs and Considerations

The technique discussed here follows a broader design approach: describing programs as values, separating intent from execution.

When applied to business-level atomicity, this approach gives you:

Clear, unambiguous behaviour: Trading framework convenience for explicit intent improves correctness guarantees.
Improved safety: Invalid transactional scopes fail at compilation time and prevent latent runtime inconsistencies.
Deterministic unit testing: Business invariants can be verified without relying on databases or transactions middleware.
Composable, refactoring-safe code: Locality leads to a modular design that is easier to reason about and evolve.

What it costs:

Shifted responsibility: Developers take ownership of concerns often delegated to frameworks (transaction scoping, propagation, rollback semantics).
Steeper learning curve: Teams unfamiliar with effect systems or algebraic APIs may experience increased cognitive load and reduced productivity in the short to mid term.
Additional machinery: Implementing a TransactionManager, in-memory versions of Ports/Repositories, and different execution strategies demands design and maintenance.

When it is a fit:

The domain exhibits multiple known failure modes.
State inconsistencies are business-critical and unacceptable.
Correctness must be preserved as the system evolves.
The “program as values” paradigm is beneficial elsewhere in the system.

When it is not:

Teams heavily rely on framework conventions for productivity.
Transactional requirements are simple and unlikely to evolve.
Strong discipline and extensive testing compensates implicitness.

Tradeoffs may or may not be acceptable. What’s important is awareness, deliberate reasoning, making an informed decision and owning the consequences. Being aware of alternatives is fundamental when choosing an approach that is aligned with the business and the team’s preferences.

Notes on Software Design

Reflections from the field

Tag: Scala