Do not infer disk bottlenecks from iowait or utilization alone

perf-012

Intent

Prevent misleading disk alerts and bad tuning work driven by summary metrics that do not capture what application threads actually experience.

Applicability

Applies when the diff adds or changes storage performance dashboards, alerts, SLOs, or automated diagnostics. Return unknown if the change is not about storage observability or if no disk-related conclusions are being drawn.

What to inspect

Alert rules, dashboard queries, threshold docs, recorded metrics, and any code that labels a condition as a disk bottleneck.

Pass criteria

The change uses request latency, queue depth, blocked-thread time, pressure metrics, or a similarly direct pain metric for the application, optionally alongside %util or iowait.

Fail criteria

The diff declares or alerts on a disk bottleneck using only iowait, disk utilization, or similar interval summaries, with no direct latency or blocking metric to validate application impact.

Do not flag

Do not flag coarse host-overview dashboards that merely display iowait or utilization as contextual signals. Do not flag repos where a linked shared dashboard already pairs these summaries with direct latency or queueing metrics.

Confidence guidance

HIGH when the alert or rule keys solely off iowait or %util. MEDIUM when summary metrics dominate but a weak corroborating metric exists. LOW when the actual downstream dashboard content is not visible in the repo.

Remediation

Alert on request latency, queue depth, blocked-thread time, or another direct storage pain metric instead of iowait or utilization alone.

Pass example

alert: DiskLatencyHigh
expr: histogram_quantile(0.99, sum by (le, device) (rate(block_io_latency_seconds_bucket[5m]))) > 0.050

Fail example

alert: DiskBottleneck
expr: avg(rate(node_cpu_seconds_total{mode="iowait"}[5m])) > 0.30

Sources

  • Systems Performance — Brendan Gregg book