Use Case 1: Investigate a Critical Alert with Full Context Analysis
User Goal
Receive a critical alert and conduct a comprehensive root cause investigation by analyzing alert details, affected resources, topology relationships, metric patterns, and traces.
When to Use This
Use this workflow when:
- A critical alert fires and you need to understand the underlying cause
- You want a comprehensive investigation that considers multiple data sources
- You need to understand not just what broke, but why it broke
How to Start
Launch Copilot → Switch to Root Cause channel
How to Ask Copilot
You can start broad and let the agent gather context:
- “Investigate alert 121898533 and determine the probable root cause”
- “What are the affected resources and their current state?”
- “Show me the topology around this resource — are there upstream dependencies that might be causing this?”
- “What do the metrics look like before and during the alert window?”
- “Have there been similar alerts on this resource or related components recently?”
What Copilot Provides
- Alert Context: Full alert details including severity, subject, resource, metric threshold
- Resource Analysis: Current state and health of affected resources
- Topology Insights: Infrastructure relationships showing dependencies and potential impact chains
- Metric Correlation: Time-series data showing patterns before/during the alert
- Historical Context: Similar past alerts and their resolutions
- Root Cause Hypothesis: Probable root cause based on all gathered evidence with confidence level
What to Ask Next (If Needed)
- “Can you check if the host this VM is running on also has issues?”
- “Show me other resources in this cluster — are they affected too?”
- “Are there any trace errors correlating with this metric spike?”
Actions You Can Take
- Review the evidence chain and validate hypothesis
- Drill into specific metrics or traces mentioned
- Request investigation of related resources
- Use findings to guide remediation
Outcome
You have a clear understanding of what triggered the alert, which components are involved, and a data-backed hypothesis of the probable root cause to guide remediation.
Use Case 2: Analyze Correlated Alerts (Inference Investigation)
User Goal
Multiple related alerts have been correlated into an inference. Investigate the inference to understand the single underlying issue causing all the alerts.
When to Use This
Use this workflow when:
- Multiple alerts on the same or related resources have been grouped as an inference
- You need to understand the relationship between correlated alerts
- You want to find the single root cause affecting multiple components
How to Start
Launch Copilot → Switch to Root Cause channel. NOTE: Probable Root Cause Agent automatically fetches the insights generated for the inference alerts (if available) when you start the investigation.
How to Ask Copilot
Start with the inference and narrow down systematically:
- “Analyze the alert 121898533 and determine the root cause”
- “What are all the alerts included in this inference?”
- “Show me the timeline — which alert fired first and what followed?”
- “What resources are affected across these alerts?”
- “Is there a common component or dependency causing all these issues?”
- “Show me the topology view of all affected resources”
What Copilot Provides
- Inference Summary: All correlated alerts with their relationships
- Timeline Analysis: Chronological order of alert firing to identify origin
- Resource Mapping: All affected resources and their interdependencies
- Topology Visualization: Infrastructure view showing how resources relate
- Common Patterns: Shared metrics, components, or events across alerts
- Root Cause Analysis: The underlying issue triggering the cascade of alerts
What to Ask Next (If Needed)
- “Which alert in this inference represents the actual root cause?”
- “Are there metric patterns showing this started from a specific component?”
- “Have we seen this pattern of alerts together before?”
- “Could this be a cascading failure from a single point?”
Actions You Can Take
- Identify the primary alert representing the root cause
- Investigate the origin component more deeply
- Understand blast radius and affected services
- Plan remediation targeting the root cause
Outcome
You understand how multiple alerts relate to each other, identify the single underlying issue, and can focus remediation on the actual root cause rather than symptoms.
Use Case 3: Service-Level Root Cause with Trace Analysis
User Goal
A service is experiencing errors or latency issues. Use distributed traces, service maps, and dependencies to pinpoint the failing component, slow operation, or infrastructure bottleneck.
When to Use This
Use this workflow when:
- Application/service alerts fire (errors, latency, throughput drops)
- Multi-service architecture suspected
- Traces and eBPF data are available
- You need to identify application-level or network-level issues
How to Start
Launch Copilot → Switch to Root Cause channel
How to Ask Copilot
Provide service context and let the agent map dependencies:
- “My payment-api-service had high errors in the last hour — investigate the root cause”
- “Get the service overview and show me the full service map”
- “Which downstream services or databases is payment-api-service calling?”
- “Analyze traces for payment-api-service — which operations are failing or slow?”
- “Show me error stack traces and identify the bottleneck”
- “Are there network latency issues between services based on eBPF data?”
What Copilot Provides
- Service Overview: Health, throughput, error rate, latency percentiles
- Service Map: Full dependency graph showing upstream and downstream services
- Trace Analysis:
- Slow spans and operations
- Error hotspots with stack traces
- Request paths showing latency breakdown
- Infrastructure Correlation: Host, network, or database issues affecting the service
- Root Cause Identification: Whether issue is application, infrastructure, network, or external dependency
What to Ask Next (If Needed)
- “Is the database service this service calls also showing high latency?”
- “Show me the specific stack trace for the most common error”
- “Are there any failed network calls or timeouts in the traces?”
- “Compare current trace patterns to baseline — what changed?”
- “Is this affecting only certain operations or all traffic?”
Actions You Can Take
- Drill into specific failing operations or endpoints
- Investigate identified bottleneck services or infrastructure
- Check deployment or config changes around error spike time
- Correlate with infrastructure metrics (CPU, memory, network)
Outcome
Clear identification of which service, operation, or infrastructure component is the root cause, with trace-level evidence showing exactly where failures or slowdowns occur.
Example Use Cases
Use this template to document use cases such as:
- Identifying the most impacted resource and correlating alerts
- Analyzing an alert to understand why it fired
- Analyzing and resolving a ticket
- Understanding alert trends over time
- Investigating network device issues
- Understanding policy violations