{"id":351,"date":"2024-03-21T09:54:40","date_gmt":"2024-03-21T09:54:40","guid":{"rendered":"https:\/\/blog.spike.sh\/2024\/03\/21\/postmortem-incident-grouping\/"},"modified":"2025-06-05T20:18:13","modified_gmt":"2025-06-05T14:48:13","slug":"postmortem-incident-grouping","status":"publish","type":"post","link":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/","title":{"rendered":"Postmortem on Incorrect Incident Grouping"},"content":{"rendered":"\n<nav aria-label=\"Table of Contents\" class=\"wp-block-table-of-contents\"><ol><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#summary\">Summary<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#leadup\">Leadup<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#impact\">Impact<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#detection\">Detection<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#response\">Response<\/a><ol><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#putting-spike-to-good-use\">Putting Spike to good use<\/a><\/li><\/ol><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#recovery\">Recovery<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#lessons\">Lessons<\/a><\/li><li><a class=\"wp-block-table-of-contents__entry\" href=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#closing-note\">Closing note<\/a><\/li><\/ol><\/nav>\n\n\n\n<p class=\"wp-block-paragraph\">On March 14th, we encountered an incident involving incorrect grouping of different incidents. Our postmortem has some extensive details for all our users. At Spike, we are committed to alerting you when things go awry so it\u2019s only fair we keep it absolutely transparent regarding any incidents.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"summary\">Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">From 12th March 2024, 11:00 PM UTC, incidents began to be grouped incorrectly, resulting in them being assigned the same public-facing ids known as <code>counterIds<\/code>. The issue surfaced when a user reported a discrepancy on our dashboard at 1 AM UTC on 13th March 2024. Within two hours, Damanpreet and Kaushik convened to tackle the issue, uncovering the strange grouping and mix-up in <code>counterIds<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For instance, though incident example-123 was alerted, a previous incident was displayed on the dashboard instead.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our database tracks all occurrences, differentiating repeated events through a boolean flag labeled <code>latest<\/code>. Typically, when an event reoccurs, its predecessor is flagged as false, enabling the new occurrence to assume the same public identifier. However, the situation reported by our user featured two events simultaneously flagged as <code>latest<\/code>, causing confusion.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"602\" height=\"305\" data-attachment-id=\"623\" data-permalink=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/incident-latest-true-1\/\" data-orig-file=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-latest-true-1-.png\" data-orig-size=\"602,305\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Incident-latest-true&amp;#8211;1-\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-latest-true-1-.png\" src=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-latest-true-1-.png\" alt=\"\" class=\"wp-image-623\" srcset=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-latest-true-1-.png 602w, https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-latest-true-1--300x152.png 300w, https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-latest-true-1--600x305.png 600w\" sizes=\"auto, (max-width: 602px) 100vw, 602px\" \/><figcaption class=\"wp-element-caption\">Green indicates latest true, while Mustard indicates latest false<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Upon analyzing our database of nearly 15 million records, we identified that 7 customers and 26 incidents were affected. Although it&#8217;s evident that alerts for some incidents weren&#8217;t dispatched, a detailed investigation with our vendors is required to learn the full scope. Worst-case scenario, it&#8217;s possible that alerts for all 26 incidents were failed to be sent.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This incident has been classified as severity SEV2 with P1 priority.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"leadup\">Leadup<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>latest<\/code> flag was introduced in the initial version to handle the grouping and reopening of incidents. In the past two weeks, we updated the grouping logic during regular maintenance. This introduced a bug that incorrectly left the <code>latest<\/code> flag set to true.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This oversight led to a domino effect. It caused issues with CounterIds being skipped or reused. When one incident failed to update the flag, it affected subsequent incidents by improperly grouping CounterIds.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"impact\">Impact<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Our post-incident analysis revealed that 7 customers and 26 incidents were affected. It appears that alerts for these incidents <em>may<\/em> not have been sent.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"detection\">Detection<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">An issue was first brought to our attention by a user at 11 PM UTC on March 8th, 2024. We promptly conducted an investigation and applied a temporary solution that initially proved effective.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Realizing the need for a deeper analysis, we turned to <strong>Warden<\/strong>, a service within Spike engines designed to detect anomalies in incident data. This helped us identify if similar issues were affecting other users.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"response\">Response<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Damanpreet led the response and began the initial investigations. Within hours, we all joined to understand the data and identify the root cause.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"putting-spike-to-good-use\">Putting Spike to good use<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">We launched a new service, <strong>Warden<\/strong>, on Spike for Slack alerts. This setup instantly notifies us whenever Warden spots patterns similar to newly triggered incidents.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Starting March 13th, these alerts have been crucial. They flagged multiple incidents impacting a few users, enabling us to respond swiftly. Thanks to Spike alerts and the detailed data from Warden, we were quick to trace and export the necessary logs.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"472\" data-attachment-id=\"624\" data-permalink=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/image\/\" data-orig-file=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/image.png\" data-orig-size=\"1614,744\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/image-1024x472.png\" src=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/image-1024x472.png\" alt=\"\" class=\"wp-image-624\" srcset=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/image-1024x472.png 1024w, https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/image-300x138.png 300w, https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/image-768x354.png 768w, https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/image-1536x708.png 1536w, https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/image-1200x553.png 1200w, https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/image.png 1614w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Spike triggered alerts on Slack when it detected affected incidents<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">By then, a pattern began to emerge, highlighting the effectiveness of our new system.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"recovery\">Recovery<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We began our recovery by removing flag dependencies and cleaning up our incidents collection. This process left us with a main collection of only unique incidents and moved all past occurrences to a new <code>History<\/code> collection.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>History<\/code> collection is structured to store a complete snapshot of each incident as it happened, including details like priority, severity, and mute status. This makes it easier to work with and learn from our historical data.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"602\" height=\"305\" data-attachment-id=\"625\" data-permalink=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/incident-clean-up-and-history-added-1\/\" data-orig-file=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-clean-up-and-history-added-1-.png\" data-orig-size=\"602,305\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Incident-clean-up-and-history-added&amp;#8211;1-\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-clean-up-and-history-added-1-.png\" src=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-clean-up-and-history-added-1-.png\" alt=\"\" class=\"wp-image-625\" srcset=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-clean-up-and-history-added-1-.png 602w, https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-clean-up-and-history-added-1--300x152.png 300w, https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Incident-clean-up-and-history-added-1--600x305.png 600w\" sizes=\"auto, (max-width: 602px) 100vw, 602px\" \/><figcaption class=\"wp-element-caption\">Green are unique incidents and beige is history of repeated occurrences<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Data cleanup ftw.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"lessons\">Lessons<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Some key lessons<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Maintaining flags by collating duplicate data in one collection is painful to maintain.<\/li>\n\n\n\n<li>Clean, organised data is underrated.<\/li>\n\n\n\n<li>Better performance, data redundancy, and maintenance overhead needs to be primarily kept in mind while scaling.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"closing-note\">Closing note<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Our past architectural decisions were well-suited for our initial scale, &nbsp;supporting our growth phase. We&#8217;re been on a roll on rebuilding and enhancing several services for greater scalability. Opting not to over-engineer initially was a good choice, it gifted us with flexibility and agility.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This evolution also doubles down on continuous improvement and readiness for future challenges. Our apologies to all our users \ud83d\ude4f<\/p>\n","protected":false},"excerpt":{"rendered":"<p>On March 14th, we encountered an incident involving incorrect grouping of different incidents. Our postmortem has some extensive details for all our users. At Spike, we are committed to alerting you when things go awry so it\u2019s only fair we keep it absolutely transparent regarding any incidents. Summary From 12th March 2024, 11:00 PM UTC, [&hellip;]<\/p>\n","protected":false},"author":191914268,"featured_media":626,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","_lmt_disableupdate":"","_lmt_disable":"","_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"{title}\n\n{excerpt}\n\n{url}","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_wpas_customize_per_network":false,"jetpack_post_was_ever_published":false},"categories":[1444],"tags":[],"class_list":["post-351","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-postmortem"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Postmortem on Incorrect Incident Grouping<\/title>\n<meta name=\"description\" content=\"Postmortem on incorrect incident grouping at Spike: what went wrong, how it was resolved, and key lessons for better incident management.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Postmortem on Incorrect Incident Grouping\" \/>\n<meta property=\"og:description\" content=\"Postmortem on incorrect incident grouping at Spike: what went wrong, how it was resolved, and key lessons for better incident management.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/\" \/>\n<meta property=\"og:site_name\" content=\"Spike&#039;s blog\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-21T09:54:40+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-05T14:48:13+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Postmortem-incident-groupring-v2-4-.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1288\" \/>\n\t<meta property=\"og:image:height\" content=\"728\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Kaushik\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Kaushik\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/\"},\"author\":{\"name\":\"Kaushik\",\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/#\\\/schema\\\/person\\\/b137e57ace218547f02b86fdcb2d0e64\"},\"headline\":\"Postmortem on Incorrect Incident Grouping\",\"datePublished\":\"2024-03-21T09:54:40+00:00\",\"dateModified\":\"2025-06-05T14:48:13+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/\"},\"wordCount\":740,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/blog.spike.sh\\\/wp-content\\\/uploads\\\/2024\\\/03\\\/Postmortem-incident-groupring-v2-4-.png\",\"articleSection\":[\"Postmortem\"],\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/\",\"url\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/\",\"name\":\"Postmortem on Incorrect Incident Grouping\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/blog.spike.sh\\\/wp-content\\\/uploads\\\/2024\\\/03\\\/Postmortem-incident-groupring-v2-4-.png\",\"datePublished\":\"2024-03-21T09:54:40+00:00\",\"dateModified\":\"2025-06-05T14:48:13+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/#\\\/schema\\\/person\\\/b137e57ace218547f02b86fdcb2d0e64\"},\"description\":\"Postmortem on incorrect incident grouping at Spike: what went wrong, how it was resolved, and key lessons for better incident management.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/#primaryimage\",\"url\":\"https:\\\/\\\/blog.spike.sh\\\/wp-content\\\/uploads\\\/2024\\\/03\\\/Postmortem-incident-groupring-v2-4-.png\",\"contentUrl\":\"https:\\\/\\\/blog.spike.sh\\\/wp-content\\\/uploads\\\/2024\\\/03\\\/Postmortem-incident-groupring-v2-4-.png\",\"width\":1288,\"height\":728},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/postmortem-incident-grouping\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/blog.spike.sh\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Postmortem on Incorrect Incident Grouping\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/#website\",\"url\":\"https:\\\/\\\/blog.spike.sh\\\/\",\"name\":\"Spike&#039;s blog\",\"description\":\"Learnings and opinions in a changing world\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/blog.spike.sh\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-GB\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/blog.spike.sh\\\/#\\\/schema\\\/person\\\/b137e57ace218547f02b86fdcb2d0e64\",\"name\":\"Kaushik\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/c7ec6b633161978fc09ed325cefde9061797a65a730e4b98c0eb26bc6925bc81?s=96&d=robohash&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/c7ec6b633161978fc09ed325cefde9061797a65a730e4b98c0eb26bc6925bc81?s=96&d=robohash&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/c7ec6b633161978fc09ed325cefde9061797a65a730e4b98c0eb26bc6925bc81?s=96&d=robohash&r=g\",\"caption\":\"Kaushik\"},\"description\":\"Founder of Spike. I like sharing how we are building Spike and the intricacies of building a startup by waking people up for critical incidents.\",\"url\":\"https:\\\/\\\/blog.spike.sh\\\/author\\\/spikehq\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Postmortem on Incorrect Incident Grouping","description":"Postmortem on incorrect incident grouping at Spike: what went wrong, how it was resolved, and key lessons for better incident management.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/","og_locale":"en_GB","og_type":"article","og_title":"Postmortem on Incorrect Incident Grouping","og_description":"Postmortem on incorrect incident grouping at Spike: what went wrong, how it was resolved, and key lessons for better incident management.","og_url":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/","og_site_name":"Spike&#039;s blog","article_published_time":"2024-03-21T09:54:40+00:00","article_modified_time":"2025-06-05T14:48:13+00:00","og_image":[{"width":1288,"height":728,"url":"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Postmortem-incident-groupring-v2-4-.png","type":"image\/png"}],"author":"Kaushik","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Kaushik","Estimated reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#article","isPartOf":{"@id":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/"},"author":{"name":"Kaushik","@id":"https:\/\/blog.spike.sh\/#\/schema\/person\/b137e57ace218547f02b86fdcb2d0e64"},"headline":"Postmortem on Incorrect Incident Grouping","datePublished":"2024-03-21T09:54:40+00:00","dateModified":"2025-06-05T14:48:13+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/"},"wordCount":740,"commentCount":0,"image":{"@id":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Postmortem-incident-groupring-v2-4-.png","articleSection":["Postmortem"],"inLanguage":"en-GB","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/","url":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/","name":"Postmortem on Incorrect Incident Grouping","isPartOf":{"@id":"https:\/\/blog.spike.sh\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#primaryimage"},"image":{"@id":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Postmortem-incident-groupring-v2-4-.png","datePublished":"2024-03-21T09:54:40+00:00","dateModified":"2025-06-05T14:48:13+00:00","author":{"@id":"https:\/\/blog.spike.sh\/#\/schema\/person\/b137e57ace218547f02b86fdcb2d0e64"},"description":"Postmortem on incorrect incident grouping at Spike: what went wrong, how it was resolved, and key lessons for better incident management.","breadcrumb":{"@id":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.spike.sh\/postmortem-incident-grouping\/"]}]},{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#primaryimage","url":"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Postmortem-incident-groupring-v2-4-.png","contentUrl":"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Postmortem-incident-groupring-v2-4-.png","width":1288,"height":728},{"@type":"BreadcrumbList","@id":"https:\/\/blog.spike.sh\/postmortem-incident-grouping\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.spike.sh\/"},{"@type":"ListItem","position":2,"name":"Postmortem on Incorrect Incident Grouping"}]},{"@type":"WebSite","@id":"https:\/\/blog.spike.sh\/#website","url":"https:\/\/blog.spike.sh\/","name":"Spike&#039;s blog","description":"Learnings and opinions in a changing world","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.spike.sh\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-GB"},{"@type":"Person","@id":"https:\/\/blog.spike.sh\/#\/schema\/person\/b137e57ace218547f02b86fdcb2d0e64","name":"Kaushik","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/secure.gravatar.com\/avatar\/c7ec6b633161978fc09ed325cefde9061797a65a730e4b98c0eb26bc6925bc81?s=96&d=robohash&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/c7ec6b633161978fc09ed325cefde9061797a65a730e4b98c0eb26bc6925bc81?s=96&d=robohash&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/c7ec6b633161978fc09ed325cefde9061797a65a730e4b98c0eb26bc6925bc81?s=96&d=robohash&r=g","caption":"Kaushik"},"description":"Founder of Spike. I like sharing how we are building Spike and the intricacies of building a startup by waking people up for critical incidents.","url":"https:\/\/blog.spike.sh\/author\/spikehq\/"}]}},"modified_by":"Sreekar","jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/blog.spike.sh\/wp-content\/uploads\/2024\/03\/Postmortem-incident-groupring-v2-4-.png","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pfMe4Q-5F","jetpack-related-posts":[{"id":4561,"url":"https:\/\/blog.spike.sh\/postmortem-on-datadog-incidents-not-autoresolving\/","url_meta":{"origin":351,"position":0},"title":"Postmortem on Datadog incidents not auto-resolving","author":"Damanpreet","date":"18th December, 2025","format":false,"excerpt":"On December 17th, 2025, we found that Datadog incidents weren't auto-resolving due to an issue in our Incident Grouping logic. We resolved the issue, and now Datadog incidents are auto-resolved as expected. This postmortem details the incident timeline, root cause analysis, and lessons learned.","rel":"","context":"In &quot;Postmortem&quot;","block_context":{"text":"Postmortem","link":"https:\/\/blog.spike.sh\/category\/postmortem\/"},"img":{"alt_text":"Blog cover titled \"Postmortem on Datadog incidents not auto-resolving\"","src":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/12\/background-47.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/12\/background-47.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/12\/background-47.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/12\/background-47.png?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":4457,"url":"https:\/\/blog.spike.sh\/incident-postmortem\/","url_meta":{"origin":351,"position":1},"title":"Incident Postmortem: How to Learn From Failures and Build Reliable Systems","author":"Samyati Mohanty","date":"27th November, 2025","format":false,"excerpt":"Incident postmortems help teams learn from outages without blame. This guide explains what they are, how to run them well, and how they strengthen reliability and continuous improvement.","rel":"","context":"In &quot;Uncategorized&quot;","block_context":{"text":"Uncategorized","link":"https:\/\/blog.spike.sh\/category\/uncategorised\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/Getting-started-with-Incident-Management-1.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/Getting-started-with-Incident-Management-1.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/Getting-started-with-Incident-Management-1.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/Getting-started-with-Incident-Management-1.png?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/Getting-started-with-Incident-Management-1.png?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/Getting-started-with-Incident-Management-1.png?resize=1400%2C800&ssl=1 4x"},"classes":[]},{"id":4354,"url":"https:\/\/blog.spike.sh\/how-to-run-blameless-postmortem\/","url_meta":{"origin":351,"position":2},"title":"How to Conduct a Blameless Postmortem","author":"Randhir Kumar","date":"20th November, 2025","format":false,"excerpt":"Incidents happen. A blameless postmortem is how your team learns from them without finger-pointing. This blog explains how to run an effective postmortem and build a resilient engineering culture.","rel":"","context":"In &quot;Post Incident&quot;","block_context":{"text":"Post Incident","link":"https:\/\/blog.spike.sh\/category\/incident-management\/post-incident\/"},"img":{"alt_text":"Blog cover titled \"How to Conduct a Blameless Postmortem\"","src":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/The-Top-10-On-Call-Management-Tools-for-DevOps.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/The-Top-10-On-Call-Management-Tools-for-DevOps.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/The-Top-10-On-Call-Management-Tools-for-DevOps.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/The-Top-10-On-Call-Management-Tools-for-DevOps.png?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":4515,"url":"https:\/\/blog.spike.sh\/postmortem-on-call-system-discrepancy\/","url_meta":{"origin":351,"position":3},"title":"Postmortem of On-Call System Discrepancy","author":"Damanpreet","date":"4th December, 2025","format":false,"excerpt":"On December 4th, 2025, we identified a critical discrepancy between displayed on-call schedules and actual alert routing for weekly rotations. This postmortem details how a recent bug fix exposed months of underlying system misalignment, our investigation process, and key lessons.","rel":"","context":"In &quot;Postmortem&quot;","block_context":{"text":"Postmortem","link":"https:\/\/blog.spike.sh\/category\/postmortem\/"},"img":{"alt_text":"blog cover postmortem of on-call system discrepancy","src":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/12\/Postmoretem.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/12\/Postmoretem.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/12\/Postmoretem.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/12\/Postmoretem.png?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":329,"url":"https:\/\/blog.spike.sh\/postmortem-of-our-dashboards-outage\/","url_meta":{"origin":351,"position":4},"title":"Postmortem of Our Dashboard&#8217;s Outage","author":"Kaushik","date":"31st May, 2023","format":false,"excerpt":"Postmortem of Dashboard's outage on 30th May 2023. Incidents, Alerts, Escalations, API, and Status Page were NOT impacted.","rel":"","context":"In &quot;Postmortem&quot;","block_context":{"text":"Postmortem","link":"https:\/\/blog.spike.sh\/category\/postmortem\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2023\/05\/postmorem-31st-may.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2023\/05\/postmorem-31st-may.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2023\/05\/postmorem-31st-may.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2023\/05\/postmorem-31st-may.png?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2023\/05\/postmorem-31st-may.png?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2023\/05\/postmorem-31st-may.png?resize=1400%2C800&ssl=1 4x"},"classes":[]},{"id":4153,"url":"https:\/\/blog.spike.sh\/jsm-alternatives-for-incident-response\/","url_meta":{"origin":351,"position":5},"title":"Jira Service Management (JSM) Alternatives for Incident Response (2026)","author":"Sreekar","date":"12th November, 2025","format":false,"excerpt":"Don't just default to JSM after OpsGenie. This post offers a detailed review of 5 leading Jira Service Management (JSM) Alternatives for incident response, complete with a feature checklist to guide your decision.","rel":"","context":"In &quot;JSM&quot;","block_context":{"text":"JSM","link":"https:\/\/blog.spike.sh\/category\/comparison\/jsm\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/background-44-2.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/background-44-2.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/background-44-2.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.spike.sh\/wp-content\/uploads\/2025\/11\/background-44-2.png?resize=700%2C400&ssl=1 2x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/posts\/351","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/users\/191914268"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/comments?post=351"}],"version-history":[{"count":3,"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/posts\/351\/revisions"}],"predecessor-version":[{"id":1746,"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/posts\/351\/revisions\/1746"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/media\/626"}],"wp:attachment":[{"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/media?parent=351"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/categories?post=351"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.spike.sh\/wp-json\/wp\/v2\/tags?post=351"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}