PRC Agent

Use Case 1: Investigate a Critical Alert with Full Context Analysis

User Goal

Receive a critical alert and conduct a comprehensive root cause investigation by analyzing alert details, affected resources, topology relationships, metric patterns, and traces.

When to Use This

Use this workflow when:

A critical alert fires and you need to understand the underlying cause
You want a comprehensive investigation that considers multiple data sources
You need to understand not just what broke, but why it broke

How to Start

Launch Copilot → Switch to Root Cause channel

How to Ask Copilot

You can start broad and let the agent gather context:

“Investigate alert 121898533 and determine the probable root cause”
“What are the affected resources and their current state?”
“Show me the topology around this resource — are there upstream dependencies that might be causing this?”
“What do the metrics look like before and during the alert window?”
“Have there been similar alerts on this resource or related components recently?”

What Copilot Provides

Alert Context: Full alert details including severity, subject, resource, metric threshold
Resource Analysis: Current state and health of affected resources
Topology Insights: Infrastructure relationships showing dependencies and potential impact chains
Metric Correlation: Time-series data showing patterns before/during the alert
Historical Context: Similar past alerts and their resolutions
Root Cause Hypothesis: Probable root cause based on all gathered evidence with confidence level

What to Ask Next (If Needed)

“Can you check if the host this VM is running on also has issues?”
“Show me other resources in this cluster — are they affected too?”
“Are there any trace errors correlating with this metric spike?”

Actions You Can Take

Review the evidence chain and validate hypothesis
Drill into specific metrics or traces mentioned
Request investigation of related resources
Use findings to guide remediation

Outcome

You have a clear understanding of what triggered the alert, which components are involved, and a data-backed hypothesis of the probable root cause to guide remediation.

Use Case 2: Analyze Correlated Alerts (Inference Investigation)

User Goal

Multiple related alerts have been correlated into an inference. Investigate the inference to understand the single underlying issue causing all the alerts.

When to Use This

Use this workflow when:

Multiple alerts on the same or related resources have been grouped as an inference
You need to understand the relationship between correlated alerts
You want to find the single root cause affecting multiple components

How to Start

Launch Copilot → Switch to Root Cause channel. NOTE: Probable Root Cause Agent automatically fetches the insights generated for the inference alerts (if available) when you start the investigation.

How to Ask Copilot

Start with the inference and narrow down systematically:

“Analyze the alert 121898533 and determine the root cause”
“What are all the alerts included in this inference?”
“Show me the timeline — which alert fired first and what followed?”
“What resources are affected across these alerts?”
“Is there a common component or dependency causing all these issues?”
“Show me the topology view of all affected resources”

What Copilot Provides

Inference Summary: All correlated alerts with their relationships
Timeline Analysis: Chronological order of alert firing to identify origin
Resource Mapping: All affected resources and their interdependencies
Topology Visualization: Infrastructure view showing how resources relate
Common Patterns: Shared metrics, components, or events across alerts
Root Cause Analysis: The underlying issue triggering the cascade of alerts

What to Ask Next (If Needed)

“Which alert in this inference represents the actual root cause?”
“Are there metric patterns showing this started from a specific component?”
“Have we seen this pattern of alerts together before?”
“Could this be a cascading failure from a single point?”

Actions You Can Take

Identify the primary alert representing the root cause
Investigate the origin component more deeply
Understand blast radius and affected services
Plan remediation targeting the root cause

Outcome

You understand how multiple alerts relate to each other, identify the single underlying issue, and can focus remediation on the actual root cause rather than symptoms.

Use Case 3: Service-Level Root Cause with Trace Analysis

User Goal

A service is experiencing errors or latency issues. Use distributed traces, service maps, and dependencies to pinpoint the failing component, slow operation, or infrastructure bottleneck.

When to Use This

Use this workflow when:

Application/service alerts fire (errors, latency, throughput drops)
Multi-service architecture suspected
Traces and eBPF data are available
You need to identify application-level or network-level issues

How to Start

Launch Copilot → Switch to Root Cause channel

How to Ask Copilot

Provide service context and let the agent map dependencies:

“My payment-api-service had high errors in the last hour — investigate the root cause”
“Get the service overview and show me the full service map”
“Which downstream services or databases is payment-api-service calling?”
“Analyze traces for payment-api-service — which operations are failing or slow?”
“Show me error stack traces and identify the bottleneck”
“Are there network latency issues between services based on eBPF data?”

What Copilot Provides

Service Overview: Health, throughput, error rate, latency percentiles
Service Map: Full dependency graph showing upstream and downstream services
Trace Analysis:
- Slow spans and operations
- Error hotspots with stack traces
- Request paths showing latency breakdown
Infrastructure Correlation: Host, network, or database issues affecting the service
Root Cause Identification: Whether issue is application, infrastructure, network, or external dependency

What to Ask Next (If Needed)

“Is the database service this service calls also showing high latency?”
“Show me the specific stack trace for the most common error”
“Are there any failed network calls or timeouts in the traces?”
“Compare current trace patterns to baseline — what changed?”
“Is this affecting only certain operations or all traffic?”

Actions You Can Take

Drill into specific failing operations or endpoints
Investigate identified bottleneck services or infrastructure
Check deployment or config changes around error spike time
Correlate with infrastructure metrics (CPU, memory, network)

Outcome

Clear identification of which service, operation, or infrastructure component is the root cause, with trace-level evidence showing exactly where failures or slowdowns occur.

Example Use Cases

Use this template to document use cases such as:

Identifying the most impacted resource and correlating alerts
Analyzing an alert to understand why it fired
Analyzing and resolving a ticket
Understanding alert trends over time
Investigating network device issues
Understanding policy violations