ForgeQubit.
← Engineering Blog/Automation Engineering/FQ-04

WhySOPsBreakWithoutEvent-DrivenSystems

SOPs describe how humans cope with a process. Event-driven systems describe what actually happened. The translation is where most automation projects die.

/published19 Feb 2026
/read-time10 min read
/byForgequbit Engineering

Every automation project has the same opening scene. Someone pastes a Google Doc SOP into a Slack channel. Someone else says "let's automate this." Six months later, the SOP has drifted, the automation has calcified into something neither fully correct nor fully wrong, and the team has three sources of truth arguing quietly with each other in production.

The root cause is almost never the tool. It is a category error about what an SOP is, and what an automation actually encodes.

SOPs are descriptions. Systems are contracts.#

An SOP is a description of how humans currently cope with a process. It's written in natural language. It's full of implicit context, assumed judgement, and exceptions that aren't labelled as exceptions because everyone who reads the document already knows them.

A running system, by contrast, is a contract. Every state is named. Every transition is conditional on something the system can actually inspect. Every branch is exhaustive — what happens at every step, including the steps the author didn't think to document.

The translation gap (and where projects die)#

The gap between an SOP and a production-grade system is roughly this: an SOP has one path and three footnotes. A system has one path and forty edges — every one of them producing a concrete event.

/textblock
# SOP (natural language)
1. Customer places order.
2. Ops checks address.
3. If address is weird, resolve it.
4. Dispatch to carrier.

# Event model (contract)
order.created
  → address.validated            (ok | ambiguous | invalid)
  → address.resolved             (if ambiguous or invalid)
  → carrier.selected
  → carrier.booking.requested
  → carrier.booking.confirmed    (or .failed)
  → dispatch.handed_over

Notice what survives the translation. The happy path is the same. But the system also records that the address was ambiguous, that we resolved it, which carrier was selected, whether the booking failed, and what was handed over. Every one of those events is queryable, replayable, and instrumentable.

Encode it as a state machine#

The most reliable translation tool we use on audits is a state machine sketch. Before a line of automation code is written, we ask: what are the states this unit of work can be in, and what are the allowed transitions?

States
Every named, observable condition the work can be in (e.g. received, validating, waiting_for_human, booked, dispatched, closed).
Transitions
The allowed moves between states, each triggered by an event (e.g. address.resolved moves received → validated).
Guards
Conditions on a transition (e.g. can only move to booked if carrier.booking.confirmed fired).
Side-effects
The concrete actions a transition triggers (label printing, customer notification, ledger entry).

This is the model the automation encodes. Not the SOP — the state machine. The SOP is then re-generated from the model, which keeps both in sync.

Stop automating SOPs. Start modelling them.#

If there is one shift that separates the automation projects that hold their value from the ones that silently decay, it is this: the projects that last don't automate the SOP. They model the underlying process, generate the SOP from the model, and leave the model as the source of truth.

The SOP is the README. The event model is the code. You don't deploy READMEs to production.

Every subsequent change flows through the model. New exception? Add a state. Policy change? Update a guard. Compliance requirement? Emit an additional event. The system stays consistent with reality because reality and the system are described in the same language.

Teams that make this shift stop arguing about whose SOP is current. Instead they argue about which edges of the state machine deserve more attention. The second conversation compounds. The first one doesn't.

/filed-underAutomation Engineering · FQ-04
All articles
/keep-reading

Adjacent articles.

FQ-01Systems Thinking

Why Most Automations Fail at Scale

Every operations team eventually hits the wall: the automations that worked at 200 events a day collapse at 20,000. The reason is almost never the tool. It is the absence of four engineering primitives.

9 min readRead
FQ-02Operations Architecture

The Hidden Architecture Behind High-Performing Ops Teams

From the outside, two ops teams processing the same volume look identical. From the inside, one is running a hidden five-layer architecture and the other is holding the pipeline together with humans. This is the difference.

11 min readRead
/next

If this described a problem you actually have, the fastest next step is an Operations Audit.