Three basic ingredients are required for monitoring success:
- Collection of key performance indicators (KPIs) used in alarming rules
- Machine learning technology and proactive anomaly detection
- A consolidated monitoring view that includes performance and exception data
In this blog, we’ll review real user data where Metricly detected a performance issue on the disk a full hour before the disk started having volume errors.
Early Detection
Metricly recently detected a disk issue before the Windows system reported the problem in the form of exception events indicating “disk volume errors.” The one-hour advanced notification let the team proactively address these issues before an outage occurred.
Understanding KPIs for Proactive Monitoring Success
One of the keys to success is also understanding the metrics. Here we know the disk queue length is a good leading indicator of an issue which is the reason behind including it in a default alerting policy. At Metricly, we research best practices for monitoring and have created a number of out-of-the-box policies that leverage these KPIs. It helps get our users up and running as soon as they set up a data source.
Proactively Detecting Performance Issues with Machine Learning
In this example, Metricly detected a performance issue on the disk a full hour before the operating system reported seeing disk volume errors.
In the screenshot, you can Metricly’s machine learning in action. Metricly automatically learns the behaviors of this environment over time and creates bands of normalcy (purple and green highlights.) These bands act like dynamic thresholds – they indicate where expected values for this metric should be at the given point in time. In Metricly, you can create policies that alert off of these ranges instead of just setting static threshold alerts.
In our example, when the metric values went out of the expected range, alerts in Metricly fired immediately. However, you can see the event generated in Windows Event Viewer on the right side came an hour later than Metricly’s alert.
A few minutes later, we saw multiple issues on the disk from the Windows subsystem. Metricly can surface all the errors coming from the operating system as events. This is very helpful because now you can see on a single view the OS error messages along with the performance alerts.
Note that the above screen shot also illustrates the adaptive and self-learning capability of the Metricly machine learning engine. It is designed to alarm upon detection of a deviation that is operationally-relevant — but only for for a period of time, after which the algorithms proceed to learn the new behavior. This is designed to avoid excessive alarming, and also to avoid alarming every day or every week in a case the behavior is caused by a daily or weekly backup for example.
Metricly coaches users throughout their cloud journey to organize, plan, analyze, and optimize their public cloud resources.
Try Metricly Free