← Back to The Lab
§ PatternMay 5, 202611 min

The kill switch pattern, and why your CIO needs to know where it is at 3 AM

A production agent without a kill switch is not a production agent. The architecture pattern, the team protocol, and the failure modes a kill switch must handle.

ShareXLinkedInFacebook

"You need a kill switch. And you need someone who knows how to use it. The CIO should know where that kill switch is, and multiple people should know where it is if it goes sideways."

That is John Bruggeman, CISO at CBTS, in a recent interview about AI operational risk. He is right. He is also describing a control that almost nobody in the agent ecosystem has actually built.

This is the pattern, end to end.

§01 — What a kill switch actually is

A kill switch is not a button on a dashboard. It is not "ssh into the server and kill the process." It is a runtime control that satisfies four properties.

Atomic. One action stops the agent. Not a sequence of seven steps. Not a Slack message to engineering. One action.

Authoritative. When the kill switch fires, the agent stops. There is no negotiation, no graceful degradation, no "let me finish this turn first." The runtime forcibly terminates the agent's loop and revokes its credentials.

Observable. Everyone who needs to know that the kill switch fired, knows. The on-call rotation, the SRE channel, the audit log, the affected downstream systems. No silent kills.

Reversible at the system level, not the agent level. Pulling the switch should not corrupt the data the agent was writing to. The runtime needs to handle in-flight transactions cleanly, either committing them with a marker or rolling them back to a known-good state.

If your kill switch satisfies fewer than four of these, it is not a kill switch. It is a stop button on a vending machine.

§02 — The three failure modes a kill switch must handle

A kill switch is not just "the agent is doing something I do not like." It is a response to one of three specific failure modes, each of which has different signals.

Mode 1: goal drift. The agent is technically doing what you asked, but the cumulative effect is no longer what you wanted. The Elloe AI Lab paper this week put hard numbers on this: nearly 90% of tested agents showed measurable goal drift after about 30 steps. The signal is usually a slow degradation in some downstream metric, visible only if you are looking at trend lines rather than individual actions.

Mode 2: tool surface compromise. The agent has been redirected by adversarial input — a memory-poisoning attack, a prompt injection in retrieved content, a malicious tool response. The signal is anomalous tool call patterns, especially calls to destructive or high-privilege tools that the agent does not normally use. This is the failure mode that 94% of memory-retentive agents are vulnerable to, per the same paper.

Mode 3: cascading economic damage. The agent is doing exactly what it was designed to do, at scale, but the unit economics broke and nobody noticed. The classic example is the beverage manufacturer whose vision system kept ordering production runs because new holiday labels confused it — several hundred thousand excess cans before anyone caught it. The signal is volume, not anomaly. The actions look normal individually.

A single kill switch design does not handle all three. Mode 1 needs trend-based triggers. Mode 2 needs anomaly detection on tool calls. Mode 3 needs hard rate limits on the tools themselves. You need all three.

§03 — The implementation, in three layers

Here is how I build this in our four-layer architecture.

Layer I (Miro): the kill paths are mapped. Every agent diagram includes the kill paths as first-class elements. Not as an afterthought in the corner. The whiteboard answers: who can pull the switch? What signals trigger an automatic pull? What state does the system enter after the pull?

Layer II (Notion): the protocol is documented. A SOUL.md section dedicated to kill switch operation. Names of the on-call humans. Slack channel. Phone numbers. The runbook for what to do in the first five minutes after the switch fires. This sounds like security theater. It is not. At 3 AM, when something is wrong and the on-call engineer is half-asleep, the runbook is the difference between a 15-minute incident and a 4-hour incident.

Layer IV (Hermes): the tools enforce it. A kill endpoint on the agent runtime that does the four things from §01. Credentials revocation, in-flight transaction handling, downstream notification, audit logging. Plus rate limits on every destructive tool, with hard caps that fire automatically when exceeded. The agent does not need a human to pull a switch if its delete_records tool refuses to execute past 100 calls per hour.

§04 — The team protocol that makes it work

The technical pattern is the easy part. The team protocol is harder.

Three rules from incidents I have lived through.

Rule 1: at least three people know how to pull the switch, and at least two of them are not on the engineering team. The CIO and the head of operations should be on this list. If only engineering can stop the agent, you have a single point of failure that is also the team most likely to be biased toward "let me try one more fix first."

Rule 2: pulling the switch is never blamed. If the on-call engineer pulls it and it turns out to be a false alarm, that is fine. Better to false-positive on a kill than to false-negative on a real incident. Cultures that punish unnecessary kills get fewer kills, including the necessary ones.

Rule 3: the switch is tested monthly. Not in production. In a staging environment with realistic traffic. Every month. If you have not pulled the switch in the last 30 days, you do not know if it works. Fire drills are not optional.

§05 — What this looks like in practice

For our manufacturing client (the Lead Discovery Agent in §01 of /cases), the kill switch is wired into three triggers. A manual pull from the operations dashboard. An automatic pull when outbound message volume exceeds 3x the seven-day average. An automatic pull when any single contact receives more than two messages in a 24-hour window.

The third trigger fired once in the last six months. A bug in the deduplication logic was about to send the same lead a fourth follow-up. The switch caught it before it sent. The bug was fixed in 90 minutes. The client never noticed.

That is what a working kill switch looks like. Most of the time, it does nothing. The one time it matters, it saves the engagement.

If your production agent does not have one, you are operating without a seatbelt. The crash you are not having yet is not evidence that the seatbelt is unnecessary. It is evidence that you have been lucky.

ShareXLinkedInFacebook