Machine learning (ML) is used to find repeated alert sequence patterns that can be correlated. You can specify an alert correlation policy that controls how alerts are correlated.
Approach to alert correlation
Alert correlation best practice recommends a four-step approach.
Step 1: Enable the correlation policy in observed mode
Create one policy in observed mode. The policy does ML correlation based on the alert metric sequence. If you have a service group configured in the system, also add service groups/device groups identical to the policy. If not, leave similarly empty.
Step 2: Observe the correlation
Full alert transparency is provided so you can see all of the alerts that were involved in an alert correlation sequence. Observe the alert sequence patterns and correlation results to determine if alerting accurately reflects the anomalies reported in your environment and supports recovery from fault conditions. When you are satisfied that alerts are accurately reported, you can fully enable alert correlation or fine-tune the alert correlation policy as needed.
Step 3: Fine-tune the correlation policy
If the observed alert correlation policy results in alert notifications that accurately reflect system fault conditions, you might not need to fine-tune the policy. This is usually the case, where out-of-the-box alert correlation successfully handles alert sequences. Some environments impose unique requirements on alert correlation so you might need to fine-tune the correlation. Possible scenarios and solutions include:
Observe the alert sequence views. If existing data indicates that there are more sequences than ML discovered, add regex-based sequences to the training file.
If correlation should be done on single-device alerts, such as CPU, memory, and disk alert sequences, configure resource alerts to be identical.
For generic alerts, more detailed information might need to be extracted from the alert subject or description to refine the sequence pattern.
An example is an SNMP trap alert, which has an
SNMP Trapmetric that does not provide specific information about the problem. More specific problem information is embedded in the alert subject. Use the Alert Enrichment policy to extract problem area information from the subject or description. ML interprets the problem area sequence instead of the metric sequence.
Step 4: Fully enable alert correlation
If there are multiple correlation policies, put ones that are more direct with higher orders. For example, ML correlation with resource/service group/device group identical should be higher than ones using topology.
Turn alert policy from Observed to ON to fully enable alert correlation.
Alert correlation factors
Several factors affect event correlation.
Co-occurrence clusters alerts based on the time they are received. The gap between adjacent alerts determines the sequence pattern start and end, with a default gap of five minutes
When you create a resource with site information, alert correlation automatically checks that correlated alerts are on the same site.
The problem area can be extracted from the alert subject or description using Alert Enrichment. This overrides the default metric name setting in the alert. The updated problem area is subsequently used in ML sequence patterns.
The Alert Enrichment policy is configurable in the UI when the Alert Enrichment add-on is added for the partner or client. After creating or updating Alert Enrichment policies, the ML model needs to be retrained.
It takes time to get new data enriched and infer new patterns so the impact of enrichment is not immediately evident. Alert Enrichment only enriches new alerts, not old alerts.
ML uses alert sequencing, which utilizes the alert problem area and component attributes of discovered sources. By default, the component is not taken into account for alert-level integrations.
A training file can be used to train the model with known sequences. The training file is only needed when additional alert sequences need to be added to those already learned or when specific alert sequences need to be specifically omitted.
Policy precedence order
When defining an alert correlation policy, consider the order of evaluation as specified by the precedence value. Any filter clauses must also be considered when specifying precedence.
Summary of alert correlation mechanisms
Alert correlation incorporates several correlation mechanisms.
The following policy modes are supported:
|ON||The policy drives automated actions on alerts.|
|OFF||The policy is inactive and does not affect alerts. You can use this mode to review a newly defined policy before choosing one of the other modes.|
|Recommend||The policy creates a recommendation for actions that you should take on the alert. Recommendations are based on learned patterns in historical alerts. The recommendation includes a link to take the action.|
|Observed||This mode permits you to simulate a policy without affecting alerts.|
The policy creates an observed alert, which simulates the original alert. The observed alert shows the actions that would be taken on the original alert if the policy were in
|Recommend and Observed modes apply to incident actions.|
Filter criteria setting
This setting filters alerts that you do not want correlated with other alerts covered by the same policy.
Inference subject setting
By default, an inference uses the subject of the alert with the earliest created date. You can optionally specify a subject to override the default subject.
Below are the supported tokens in the Inference subject field:
The correlation algorithm correlates alerts that occur near the same time and learns common alert sequences using historical data.
The continuous learning option causes the learning models to be continuously updated using recent data.
Using the advanced option, you can train the alert correlation algorithm to correlate known alert sequences. A training file is used to provide training data.
Time-based sequences correlate alerts that occur in the same time interval. For example, you can use the within time window setting to correlate all alerts that occur within a five-minute interval.
Learning reinforcement applies additional criteria in making correlation decisions on learned, trained, and time-based sequences.
Learning reinforcement can use topological relationships. Alerts that occur close in time and which are from connected resources are usually related to the same underlying cause. For example, a failed switch can cause a cascade of alerts on downstream servers and applications. In deciding whether to correlate a sequence of alerts into an inference, a higher weight is applied to sequences when associated resources are topologically related.
Attribute similarity criteria can also be used to correlate sequences. Alerts can be related to the same underlying cause if they:
- Occur at about the same time.
- Have identical or similar attributes.
For example, application failure alerts can generate multiple alerts that have a similar subject.
Use the alert similarity setting to specify alert similarity criteria.