The Alert Was Missed & Why It Shouldn’t Have Been

Summary

An AI agent detected a billing anomaly. The alert never triggered an incident. The detection logic was fine, but the pipeline from “something’s wrong” to “someone’s on-call” failed silently. The root issue wasn’t a missing tool — it was missing coordination. The team rebuilt their alert flow using events, gaining control over retries, escalation, and acknowledgement tracking.

The problem: detection isn’t enough

The company used AI agents to monitor for fraud and usage anomalies. During a weekend run, one agent correctly identified suspicious billing activity.

But the alert didn’t turn into action. No Slack pings reached the right person. PagerDuty was never triggered. Nobody knew there was an incident until Monday morning. By then, the issue had resulted in more than $100,000 in unbilled overages.

The alert logic worked. The delivery chain didn’t.

What broke (and where it usually breaks)

The pipeline between “agent detects something” and “on-call gets paged” included several moving parts:

  • Alerts had to pass through intermediate services and message queues
  • Logic for retries and escalation was distributed and inconsistent
  • There was no audit trail to show if or when a notification had failed
  • The same event was sent to Slack and email, but no one owned the end-to-end flow
  • There was no structured feedback loop for acknowledgements or escalation

Incident tools like PagerDuty work well. But they rely on being called reliably. The failure wasn’t downstream. It was in the layers before the signal ever reached the responder.

What they tried before

The team added:

  • Redundant notifications across Slack, email, and dashboards
  • Scripts to re-send alerts on failure
  • Cron jobs to verify if incidents had been acknowledged
  • Runbooks for escalation

This added noise, not clarity. The system still lacked a single source of truth for alert delivery. There was no consistent state, no retries with backoff, and no way to know if a critical alert had been lost.

The change: alerts became events with state

They rebuilt their alert pipeline using Sailhouse. The idea was simple: treat alerts like events with full lifecycle visibility.

  • The AI agent emitted a single incident_detected event
  • Sailhouse fanned it out to PagerDuty, Slack, and internal dashboards
  • Each destination had its own retry policy and delivery tracking
  • A scheduled follow-up checked for acknowledgement
  • If no one responded in time, a new event escalated the alert

This gave the team a clear flow. Each alert became observable, traceable, and configurable — not just another message in a queue.

Not a replacement, a reinforcement

Sailhouse didn’t replace PagerDuty. It made sure PagerDuty got triggered, even if the original request failed. It didn’t compete with dashboards. It just ensured incidents didn’t vanish into glue code or human error.

This gave engineers:

  • Visibility into the entire alert path
  • Declarative retry and escalation policies
  • Lightweight coordination logic without a workflow engine
  • Clear ownership of delivery and acknowledgement handling

The alert pipeline became infrastructure, not just a pile of integrations.

What changed

  • Zero missed alerts since rollout
  • Mean time to resolution dropped by half
  • On-call engineers responded faster and with more context
  • Alert delivery became observable and configurable
  • Escalation became automatic, not manual

Why it matters

As AI systems detect more problems, reliable delivery matters as much as detection. If a critical signal doesn’t reach a human, it doesn’t matter how smart your monitoring is.

This team didn’t add more tools. They connected the ones they already had. They treated alert delivery as a first-class workflow, not a background job. The result was reliability they could see and reason about — and trust that alerts would never be ignored again.