Anomaly Detection for DevOps: Adding Advanced Analytics to a DevOps Model

Ed. Note: This is part 3 of a three-part series on anomaly detection and the impact it has on a DevOps model. Part 1 examined anomaly detection in performance monitoring and the four possible outcomes of its implementation. Part 2 analyzed various tools DevOps teams can use to detect and respond to anomalies.

Ideally, anomaly detection is not simply an isolated monitoring step or the only factor in deciding whether or not to issue and alarm or take some action. For the most accurate results, advanced analytics should be applied within a more comprehensive monitoring workflow. Here is one such DevOps model that has worked well for us.

A DevOps Model: The Ideal Monitoring Workflow

Capture infrastructure and application metrics in real time
Apply multiple types of analytics to the observations
Discover deviations in the observed data
Apply structural knowledge such as relationships between components to refine raw analytic results
Assess the results within the contexts of environmental semantics and other human knowledge (at Metricly, we call this a “policy.“)

Applying Analytics to Collected Data

In this DevOps model, raw data is collected via agents and other sources. This data is then accessible to a repertoire of analytics that can be generally applicable or have a specific focus on detecting certain types of anomalies. Analytic results along with other collected data such as attributes, relationships, and configurations can be enriched with human expertise, sometime called priors or a priori information.

Conditional and Rule-Based Alerting Policies

Often, the integration of all this information looks like some sort of decision logic – perhaps a set of conditions to be tested. Conditions can be simple inequalities like if percent.utilization > 95% – this is just a simple threshold test. A more interesting condition might be: if upper.deviation exists which test to see of the current set of conditions is abnormal based on some machine learning analytics. Conditions might also include tests regarding duration which allows a human to specify that some action should be initiated only if the conditions have lasted for some period of time – say 15 minutes.

Creating Actionable Alarms and/or Notifications

The final stage of the workflow is to trigger an action. Actions can include raising an event, sending an email, or making a scale-up request that ultimately adds nodes to a cluster.
Using analytics together within a workflow such as the one shown above, DevOps staff can achieve highly accurate results – namely minimizing false positives and false negatives.

Want to see how Metricly can fit seamlessly in your team’s workflow? Try us free for 21 days.

Learn more

About Metricly

Metricly coaches users throughout their cloud journey to organize, plan, analyze, and optimize their public cloud resources.

Try Metricly Free

About the Author

Elizabeth Nichols

Elizabeth is an active Data Science Advisor @ Metricly.