3 reasons your monitoring and alert system isn't working

Pagerduty response requested

If these are the dreaded words you see before spending hours fixing your product, you’re not alone. Incidents occur to the best of companies and products, no matter their size or scale—if you don’t believe that, then take a look at the public status pages of Cloudflare or AWS. These issues are caught by well-defined alerting and monitoring.

But there’s such a thing as too many alerts. While sensitive alerts can ensure you catch issues as quickly as possible, they can also cause alert fatigue, a phenomenon where on-call employees are so sick of getting false alerts for outages or issues that they disregard a legitimate alert when an incident happens.

This is a real, worrying scenario. Even Google acknowledges and promotes this as an issue that happens with their on-call SREs* Site Reliability Engineer (SRE): a job solely dedicated to keeping services online and running well. .

There are a number of possible causes for alert fatigue, and a number of solutions—but there is no “one size fits all” approach. Some causes are:

Misconfigured Alerts
High On-call Workload
Poorly Defined Procedures

1. Misconfigured Alerts

One cause of alert fatigue is misconfigured alerts. Alerting should be relevant to your business, appropriate for any SLAs* Service Level Agreements (SLAs): a defined percentage of time you legally assure your customer that product is online and working. you must meet, and above all, not too sensitive.

If an alert is too sensitive, it can result in false positives—and the more times employees need to click the dismiss alert button, the greater chance they’ve just dismissed an alert that is real and impacts actual systems and customers. However, the opposite is also true. Alerting should be tuned to analyze and alert on errors quickly—nothing is worse than a customer calling you and informing you that a service is down before your monitoring catches it.

One way to ensure alerts are tuned properly is to test them in a development environment. Simulate a failure of a service, errors being returned, or extremely slow response times and see if your monitoring fires. If not, you probably need to rework it.

Alerts can come from a variety of sources, but logs are often the foremost way to monitor and alert on misbehaving services. Make sure your log inject pipeline is well-tuned and ready to ingest and analyze large amounts of log data.

2. High On-call Workload

While some small/medium businesses may not have staff on-call to handle incidents, as soon as a company becomes large enough, on-call staff are a must. On-call staff are staff members that are available and near a method of communication—such as a phone or computer—so that when an incident occurs, they are ready and prepared to fix things.

On-call employees don’t need to be dedicated towards monitoring, although that’s certainly a possibility. Instead, rotate through your regular employees, giving each a set time when they are on-call.

However, on-call employees shouldn’t be contacted for small things. If your on-call staff is notified every time a small issue needs to be fixed, then when a larger issue arrives, they may be distracted, tired, or not operating optimally. Ensure that during on-call periods, on-call staff are only contacted when something is wrong or a major issue or roadblock needs to be fixed.

3. Poorly Defined Procedures

You’ve assigned a staff member to be on-call, your properly-tuned alerts were trigged, and now you’re aware that an incident is occuring. But wait… what should you do about it?

An incident response plan is essential to ensuring that you respond quickly and effectively if something bad happens. Incident response plans are steps that are taken each time something happens, and can include notifying customers (such as posting on a status page), triaging why an issue has occured, and finally, solving the issue.

Once you have an incident response plan, make sure you practice it regularly. Also involve staff that are going to be on-call so that they know what to do if an issue comes in that they need to investigate and solve.

Conclusion

Nothing is more important during incident response than proper logging. By ensuring your services log any relevant data, you can analyze and filter through past logs to see what went wrong and why.

If you already have a logging solution set up, then we hope these alerting tips helped. But if not, give LogAI a shot—we handle log ingestion, analysis, querying, and storage for you, plus we have a generous free plan. For a limited time, also get a free ebook with logging tips and tricks usable across any logging platform.

Download our free ebook

Download a free ebook on log ingestion, log storage, and log querying when you sign in or sign up for a free account. Written by software engineers, for software engineers.

Download now