Routing incidents the way their severity and priority demand

Should this incident get a phone call or a Slack message? Severity and priority answer that question. This guide shows how to build routing rules around both.

Sreekar avatar

Severity and priority are two labels that describe different things about an incident. Severity covers the blast radius: how much of your system or how many customers are affected. Priority covers the urgency: how quickly someone needs to act. Routing rules then use these labels to load the right escalation policy for each incident.

This guide covers how to define your severity and priority levels and map them to escalation policies. You’ll also find a bunch of practical alert routing rule examples throughout.


Table of contents


Why both severity and priority matter

When routing rules only look at severity and ignore priority (or vice-versa), incidents get misrouted. A SEV-1 P1 incident needs an immediate phone call. A SEV-1 P3 incident needs attention too, but a Slack message with a few hours to respond is usually enough. Without both signals working together, these two incidents follow the same escalation path and one of them gets misrouted.

That mismatch is common because severity and priority don’t always align. A distributed cache going down in production affects every dependent service. It’s a high severity and high priority. A memory leak in an internal admin dashboard is also high severity if the service eventually crashes. But the priority might be P3 because no customer is affected and a restart buys time until the next working day.

A good way to anchor what each severity or priority level means for your team is to walk through real incident types and ask two questions for each:

  • How wide is the blast radius (severity)
  • How fast do we need to respond (priority)

Spike supports three severity levels (SEV-1, SEV-2, SEV-3) and five priority levels (P1 through P5), which gives you enough range to make those distinctions clearly.

Once your team has these severity and priority labels in place, the next step is mapping them to escalation policies. That’s where the severity-priority matrix comes in.


The severity-priority matrix

Which incident would pull you out of bed at 3 AM? That is your critical tier. Everything below it gets a progressively softer response.

P1P2P3P4P5
SEV-1CriticalCriticalModerateModerateLow
SEV-2ModerateModerateModerateLowLow
SEV-3ModerateLowLowLowLow

Each tier maps to an escalation policy:

  • Critical: phone call to the on-call responder, 5-minute wait before escalation, senior responder in the chain
  • Moderate: push notification or Slack, 15-30 minute wait, standard on-call rotation
  • Low: Slack or email, next business day

Reserving phone calls for the critical tier keeps that signal strong. When every incident triggers a phone call regardless of tier, responders gradually stop treating them as urgent.

This matrix is a starting framework. Your team can shift the boundaries as incident patterns get clearer.

To learn more about escalation policies, read this guide: A compass for designing your escalation policy


Setting up routing rules based on severity and priority

When your monitoring tool sends severity and priority in the payload, routing rules can read those values and load the right escalation policy directly.

IF severity is SEV-1
AND priority is P1
THEN load → critical escalation policy
(phone call, 5-minute wait time)
Alert Routing rule created on Spike
Alert Routing rule created on Spike
IF severity is SEV-3
THEN load → low-priority escalation policy
(Slack to team channel, next business day)
Alert Routing rule created on Spike
Alert Routing rule created on Spike

Not every monitoring tool sends clean severity and priority values though. In those cases, rules can infer them from other signals in the payload.

A disk usage incident might arrive without any severity label. But the payload carries the mount path and the usage percentage, which is enough to classify it.

IF title contains "disk-usage"
AND incident details [key: "mount"] = "/var/lib/postgresql"
AND incident details [key: "usage_pct"] > 90
THEN mark severity as SEV-1
AND mark priority as P1
AND load → database critical escalation policy
IF title contains "disk-usage"
AND incident details [key: "mount"] = "/tmp"
THEN mark severity as SEV-3
AND mark priority as P5
AND auto-acknowledge

A Postgres data directory at 92% and a temp directory filling up are two very different situations. These two rules give each one a different response automatically.

To learn more about payload-based routing, read this guide: How to route incidents based on what their payload says

Environment is another useful signal. The same error from the same service often deserves a different path depending on whether it triggered in production or staging.

IF title contains "order-service"
AND incident details [key: "error"] = "connection_pool_exhausted"
AND incident details [key: "env"] = "production"
THEN mark severity as SEV-1
AND mark priority as P1
AND load → commerce team critical escalation policy
IF title contains "order-service"
AND incident details [key: "error"] = "connection_pool_exhausted"
AND incident details [key: "env"] = "staging"
THEN mark severity as SEV-3
AND mark priority as P4
AND auto-acknowledge

Customer context adds another layer. Enterprise customers with SLA commitments usually warrant a faster response than the same failure on a free-tier account.

IF title contains "notification-service"
AND incident details [key: "http_status"] >= 500
AND incident details [key: "customer_tier"] = "enterprise"
AND incident details [key: "error_rate_pct"] > 5
THEN mark severity as SEV-1
AND mark priority as P1
AND load → enterprise critical escalation policy
Alert Routing rule created on Spike
Alert Routing rule created on Spike
IF title contains "notification-service"
AND incident details [key: "http_status"] >= 500
AND incident details [key: "customer_tier"] = "free"
AND incident details [key: "error_rate_pct"] > 5
THEN mark severity as SEV-2
AND mark priority as P3
AND load → default moderate escalation policy
Alert Routing rule created on Spike
Alert Routing rule created on Spike

Time-of-day conditions also work well alongside severity. A SEV-2 incident during business hours can follow a moderate path. The same incident at midnight often deserves a faster response because fewer people are around to catch things if they get worse.

IF severity is SEV-2
AND time is between 10 PM and 8 AM
THEN mark priority as P2
AND load → off-hours moderate escalation policy
(push notification, 10-minute wait time)
IF severity is SEV-2
AND time is between 8 AM and 10 PM
THEN mark priority as P3
AND load → business hours moderate escalation policy
(Slack + email, 30-minute wait time)

Spike’s Alert Routing supports everything covered in this guide, from severity and priority matching to payload comparators and time-based conditions. If your team is looking to put these ideas into practice, it’s worth giving it a try.


Keeping severity and priority definitions accurate

What counted as SEV-1 P1 six months ago might not deserve that label today. A team that classified every database connection spike as critical early on might find that most of those incidents now resolve on their own after a connection pool autoscaler was added. The incident is the same but the system around it has changed. If the classification doesn’t change with it, your critical escalation policy fires for incidents that no longer need an immediate phone call.

One way to spot this drift is to watch how responders react after picking up an incident. If they routinely downgrade SEV-1 to SEV-2 or P1 to P3, the original labels are probably too aggressive. If they routinely upgrade, the labels are probably too conservative.

Reviewing a month of incident data once a quarter usually catches these patterns early. Revisiting your severity and priority definitions as your systems evolve keeps your routing rules aligned with how your team actually responds to incidents.


FAQs

How should routing rules handle a SEV-1 P4 where severity and priority seem to conflict?

A production database replica falling several hours behind is a good case. High severity because data loss is a real risk if the primary fails. But if a second healthy replica exists, priority might be P4 because there is no immediate customer impact. The matrix in this guide places SEV-1 P4 in the moderate tier. It still gets same-day attention through a push notification or Slack rather than sitting in the low-priority queue.

Is it better to have monitoring tools set severity or let routing rules assign it?

Both work but they solve different problems. When the monitoring tool sets severity, you get consistency at the source. When routing rules assign severity, you get flexibility because the same incident can carry different severity depending on environment or customer tier. A good middle ground is to let the monitoring tool set a baseline and use routing rules to override when additional context in the payload justifies it.

How should severity and priority work for scheduled tasks that fail outside of business hours?

A nightly ETL job or a scheduled backup that fails at 3 AM might look critical based on the service name alone. But if the job runs again at 6 AM and the failure is recoverable, waking someone up at 3 AM adds no value. A time-based condition paired with a frequency threshold can handle this well. If the job fails once overnight, a SEV-3 P5 classification with auto-acknowledge is probably enough. If it fails three times in a row, upgrading to SEV-2 P2 and paging the on-call responder is worth considering.

Discover more from Spike's blog

Subscribe now to keep reading and get access to the full archive.

Continue reading