Incident management is easily one of the most annoying things anyone has to ever deal with. There will always be only a handful of people who would ever want to walk into the building on fire to mitigate. That’s the same with most engineering teams. Only a handful are willing to get in, find the root cause, and mitigate the incident.
What the heck is an incident anyway?
Formal definition:: An incident is an event that is not part of normal operations that disrupts operational processes
- Website down? Yeah that’s an incident
- server running out of space? yup, that one too
- An increasing number of transactions failing? definitely an incident
Basically, anything that interrupts the smooth operations and needs you to look into it is qualified as an incident. Or if it’s not ideal than it’s perhaps an incident.
Below are some examples of incidents::
|Revenue impacting||Website / app crashes|
|Server crash and burnouts|
|Booking and transaction failures|
|Leaving customers furious||SLA breaches|
|Delayed response times|
|Dashboards not loading|
|Incidents needing more attention||DB backups failing|
|Queue memory overloads|
|DB queries are too slow|
|Good to know incidents||CI/CD Alerts|
|CPU, Memory, I/O alerts|
|Disk space alerts|
A good rule of understanding what incidents you could get would be - Imagine the most critical part of your application and then imagine if it fails to it's job. If this ever happens, you wan to make sure that you are getting instant alerts either on phone call, sms, email, Slack, etc. The last thing anyone want is to miss out on these incidents until you start work the next day.