Paging alerts start from user-visible symptoms, not self-healed transients

ops-020

Intent

Reduce alert fatigue by paging on degraded user experience rather than raw infrastructure signals or auto-remediated transient events.

Applicability

Applies when the diff adds or changes paging alert definitions.

What to inspect

Alert queries, severity routing, incident policies, and whether the page measures user-visible harm or a self-healed transient.

Pass criteria

Pages are driven by user-visible failures, latency harm, or equivalent service-level degradation, while self-healed transients stay nonpaging.

Fail criteria

The diff pages on raw system counters or on transient events the platform normally remediates automatically.

Do not flag

Nonpaging warnings, dashboards, or business-hours triage alerts.

Confidence guidance

HIGH when the page signal is directly visible. MEDIUM when user impact is inferred from naming or linked SLI config. LOW when routing is partly external.

Remediation

Drive pages from user-visible symptoms and downgrade self-healed transients to nonpaging signals.

Pass example

alert: CheckoutAvailabilityPage
expr: slo_bad_events_total{journey="checkout"} / slo_eligible_events_total{journey="checkout"} > 0.01
labels:
  severity: page

Fail example

alert: PodRestarted
expr: increase(kube_pod_container_status_restarts_total[5m]) > 0
labels:
  severity: page

Sources

  • Observability Engineering book