Investigate
Root-cause report on a symptom — alert, service, pattern, or pasted log line. Returns the onset time, the named cause with supporting evidence, what else moved at the same time, and kubectl / curl / PromQL commands to verify. Works for sudden spikes (pager just fired) and slow drift (pattern worsening for weeks).
Example
"spike on payments-svc — root cause?"
Onset: 14:30 today.
Payment_Gateway_Timeoutjumped 200/min → 45,000/min.Cause: CPU spike on
db-replica-2matched the onset (r=0.94). What moved with it:db.replica.cpu,apm.payments.latency,kafka.consumer.lag.Verify:
kubectl describe pod db-replica-2
More to ask
- "why is
Retry_Backoff_Exhaustedfiring?" - "slow drift in checkout-svc, last 30 days"
- "full environment audit, last 30 days"
Prerequisites
This tool requires the Reporter deployed. Slow-drift investigations need continuous historical metrics, which CLI-only mode doesn't produce.
Tool schema (advanced)
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
starting_point |
string | yes | — | What to investigate, in the user's own words: a pasted log line, a pattern name, a service name, or the literal string environment / all / audit for a sweep. |
window |
string | no | 1h |
Analysis window. 1h for acute spikes; 30d for drift. Accepts any PromQL range string. Alias: timeRange. |
timeRange |
string | no | — | Alias for window for consistency with the other tools. If both are set, window wins. |
depth |
string | no | normal |
shallow = anchor service only. normal = anchor + immediate dependencies. deep = full environment-wide. |
baseline_offset |
string | no | 24h / window |
Baseline comparison offset. Defaults to 24h for short windows (acute spikes), or window value for long windows (drift). |
use_bytes |
boolean | no | false |
Use byte-based rate instead of event-count. Event-count is strongly preferred. |
environment |
string | no | — | Environment nickname — required in multi-env setups. |
Resolver fixes (2026-04-27). The pattern-exists probe now honors the user's window instead of hardcoded [5m], so a sparse pattern that fired heavily in a 7d window but is silent in the last 5 min still resolves. When the pattern doesn't exist in the requested window, a 30d wide-probe checks whether it exists at all — if so, the report tells the SRE to widen the window instead of bouncing them to event lookup. Series missing the message_pattern label are filtered out of env-audit movers so undefined rows no longer appear.