Skip to content

Investigate

Root-cause report on a symptom — alert, service, pattern, or pasted log line. Returns the onset time, the named cause with supporting evidence, what else moved at the same time, and kubectl / curl / PromQL commands to verify. Works for sudden spikes (pager just fired) and slow drift (pattern worsening for weeks).

Example

"spike on payments-svc — root cause?"

Onset: 14:30 today. Payment_Gateway_Timeout jumped 200/min → 45,000/min.

Cause: CPU spike on db-replica-2 matched the onset (r=0.94). What moved with it: db.replica.cpu, apm.payments.latency, kafka.consumer.lag.

Verify: kubectl describe pod db-replica-2

More to ask

  • "why is Retry_Backoff_Exhausted firing?"
  • "slow drift in checkout-svc, last 30 days"
  • "full environment audit, last 30 days"

Prerequisites

This tool requires the Reporter deployed. Slow-drift investigations need continuous historical metrics, which CLI-only mode doesn't produce.

Tool schema (advanced)
Field Type Required Default Description
starting_point string yes What to investigate, in the user's own words: a pasted log line, a pattern name, a service name, or the literal string environment / all / audit for a sweep.
window string no 1h Analysis window. 1h for acute spikes; 30d for drift. Accepts any PromQL range string. Alias: timeRange.
timeRange string no Alias for window for consistency with the other tools. If both are set, window wins.
depth string no normal shallow = anchor service only. normal = anchor + immediate dependencies. deep = full environment-wide.
baseline_offset string no 24h / window Baseline comparison offset. Defaults to 24h for short windows (acute spikes), or window value for long windows (drift).
use_bytes boolean no false Use byte-based rate instead of event-count. Event-count is strongly preferred.
environment string no Environment nickname — required in multi-env setups.

Resolver fixes (2026-04-27). The pattern-exists probe now honors the user's window instead of hardcoded [5m], so a sparse pattern that fired heavily in a 7d window but is silent in the last 5 min still resolves. When the pattern doesn't exist in the requested window, a 30d wide-probe checks whether it exists at all — if so, the report tells the SRE to widen the window instead of bouncing them to event lookup. Series missing the message_pattern label are filtered out of env-audit movers so undefined rows no longer appear.