For a limited time, get a free ebook with insights, tips, and tricks into production logging at scale.
Download nowPublished November 14, 2024 | 4 min read
Pagerduty response requested
If these are the dreaded words you see before spending hours fixing your product, you’re not alone. Incidents occur to the best of companies and products, no matter their size or scale—if you don’t believe that, then take a look at the public status pages of Cloudflare or AWS. These issues are caught by well-defined alerting and monitoring.
But there’s such a thing as too many alerts. While sensitive alerts can ensure you catch issues as quickly as possible, they can also cause alert fatigue, a phenomenon where on-call employees are so sick of getting false alerts for outages or issues that they disregard a legitimate alert when an incident happens.
This is a real, worrying scenario. Even Google acknowledges and promotes this as an issue that happens with their on-call SREs* Site Reliability Engineer (SRE): a job solely dedicated to keeping services online and running well. .
There are a number of possible causes for alert fatigue, and a number of solutions—but there is no “one size fits all” approach. Some causes are:
One cause of alert fatigue is misconfigured alerts. Alerting should be relevant to your business, appropriate for any SLAs* Service Level Agreements (SLAs): a defined percentage of time you legally assure your customer that product is online and working. you must meet, and above all, not too sensitive.
If an alert is too sensitive, it can result in false positives—and the more times employees need to click the dismiss alert
button, the greater chance they’ve just dismissed an alert that is real and impacts actual systems and customers. However, the opposite is also true. Alerting should be tuned to analyze and alert on errors quickly—nothing is worse than a customer calling you and informing you that a service is down before your monitoring catches it.
One way to ensure alerts are tuned properly is to test them in a development environment. Simulate a failure of a service, errors being returned, or extremely slow response times and see if your monitoring fires. If not, you probably need to rework it.
Alerts can come from a variety of sources, but logs are often the foremost way to monitor and alert on misbehaving services. Make sure your log inject pipeline is well-tuned and ready to ingest and analyze large amounts of log data.
While some small/medium businesses may not have staff on-call to handle incidents, as soon as a company becomes large enough, on-call staff are a must. On-call staff are staff members that are available and near a method of communication—such as a phone or computer—so that when an incident occurs, they are ready and prepared to fix things.
On-call employees don’t need to be dedicated towards monitoring, although that’s certainly a possibility. Instead, rotate through your regular employees, giving each a set time when they are on-call.
However, on-call employees shouldn’t be contacted for small things. If your on-call staff is notified every time a small issue needs to be fixed, then when a larger issue arrives, they may be distracted, tired, or not operating optimally. Ensure that during on-call periods, on-call staff are only contacted when something is wrong or a major issue or roadblock needs to be fixed.
You’ve assigned a staff member to be on-call, your properly-tuned alerts were trigged, and now you’re aware that an incident is occuring. But wait… what should you do about it?
An incident response plan is essential to ensuring that you respond quickly and effectively if something bad happens. Incident response plans are steps that are taken each time something happens, and can include notifying customers (such as posting on a status page), triaging why an issue has occured, and finally, solving the issue.
Once you have an incident response plan, make sure you practice it regularly. Also involve staff that are going to be on-call so that they know what to do if an issue comes in that they need to investigate and solve.
Nothing is more important during incident response than proper logging. By ensuring your services log any relevant data, you can analyze and filter through past logs to see what went wrong and why.
If you already have a logging solution set up, then we hope these alerting tips helped. But if not, give LogAI a shot—we handle log ingestion, analysis, querying, and storage for you, plus we have a generous free plan. For a limited time, also get a free ebook with logging tips and tricks usable across any logging platform.
Download our free ebook
Download a free ebook on log ingestion, log storage, and log querying when you sign in or sign up for a free account. Written by software engineers, for software engineers.
Download nowThe company indicated on this website is fictional. This website is for a class assignment. All assets used are copyrights of their respective owners and used under permissive licenses.
The information provided on this website is intended for educational purposes only. While every effort has been made to ensure the accuracy and reliability of the content, the university does not guarantee its completeness or effectiveness. This website will be deactivated once the course is completed. Some images used on this site have been sourced from the internet and are included for educational and practice purposes only. If you are the copyright owner of any image and believe it has been used without proper attribution, please contact us, and we will make the necessary corrections.
The views expressed on this site do not necessarily reflect the official policies or positions of the university. The university is not responsible for any errors or omissions or the results obtained from using this information.
By using this website, you acknowledge and agree to these terms.