Incident management is easily one of the most annoying things anyone has to ever deal with. There will always be only a handful of people who would ever want to walk into the building on fire to mitigate. That’s the same with most engineering teams. Only a handful are willing to get in, find the root cause, and mitigate the incident.

What the heck is an incident anyway?

Formal definition:: An incident is an event that is not part of normal operations that disrupts operational processes

  • Website down? Yeah that’s an incident
  • server running out of space? yup, that one too
  • An increasing number of transactions failing? definitely an incident

Basically, anything that interrupts the smooth operations and needs you to look into it is qualified as an incident. Or if it’s not ideal than it’s perhaps an incident.

Below are some examples of incidents::

Severity Incident
Revenue impacting Website / app crashes
Security vulnerabilities
Server crash and burnouts
Booking and transaction failures
Leaving customers furious SLA breaches
Delayed response times
Dashboards not loading
Incidents needing more attention DB backups failing
Queue memory overloads
DB queries are too slow
Application errors
Good to know incidents CI/CD Alerts
CPU, Memory, I/O alerts
Disk space alerts

A good rule of understanding what incidents you could get would be - Imagine the most critical part of your application and then imagine if it fails to it's job. If this ever happens, you wan to make sure that you are getting instant alerts either on phone call, sms, email, Slack, etc.  The last thing anyone want is to miss out on these incidents until you start work the next day.