What is the difference between an incident and a problem?

An incident is an unexpected service disruption requiring immediate action to limit impact. A problem is the underlying cause that creates one or more incidents. Incident management focuses on restoring the service as quickly as possible. Problem management focuses on identifying and eliminating the root cause to prevent recurrence.

How quickly should I respond to a P1 incident?

For a P1 incident (complete outage) the Incident Commander should be activated within five minutes and first diagnosis should start within fifteen minutes. Initial stakeholder communication follows within thirty minutes. These timelines must be predefined and practised so there is no debate during the incident.

What is a blameless post-mortem?

A blameless post-mortem is an incident analysis that focuses on systems, processes and circumstances rather than individual blame. The premise is that people act based on the information available to them at the time. By removing blame you encourage honest communication and discover the actual system weaknesses that need improvement.

How do I define incident severity levels?

Base severity on two factors: the scope of impact (how many users are affected) and the severity of impact (what can they no longer do). P1 is complete outage for all users, P2 is significant functionality broken for a large group, P3 is limited impact with workaround available and P4 is cosmetic or informational.

Should every incident get a post-mortem?

A full post-mortem is mandatory for P1 and P2 incidents. For P3 incidents a brief note with root cause and action items suffices. P4 incidents are treated as regular bugs. The goal is to learn from significant incidents without overwhelming the team with paperwork for every minor issue.

How do I practise incident response with my team?

Organise monthly or quarterly game days: simulate a realistic incident (database crash, deployment failure, DDoS attack) and have the team walk through the incident response procedure. Use chaos engineering tools like Gremlin or LitmusChaos to introduce controlled failures. Evaluate after each exercise what went well and improve the process.

Which metrics should I track for incident management?

The four core metrics are: MTTD (Mean Time to Detect, how quickly you detect an incident), MTTR (Mean Time to Recover, how quickly you restore service), incident frequency per category (identifies problematic areas) and the percentage of incidents with a completed post-mortem. Track these metrics monthly and set improvement targets.

How quickly should I respond to a production incident?

That depends on severity. For critical incidents directly affecting users aim for an initial response time of fifteen minutes. For serious incidents with limited impact thirty minutes is acceptable. For low priority incidents one hour is a reasonable threshold. Define these response times per severity level in advance and ensure the on-call team is always reachable within the agreed timeframes.

Incident Response Template - Free Download & Example

Respond to production incidents with structure. Incident response template with escalation matrix, communication protocol, root cause analysis and post-mortem framework.

When a production incident strikes, every minute counts. Teams without a predefined incident response plan waste precious time figuring out who does what, how to communicate and which steps to take in which order. This template provides a complete structure for managing incidents from detection to post-mortem: incident classification based on severity and impact (P1 through P4), an escalation matrix with roles (Incident Commander, Technical Lead, Communications Lead), responsibilities and contact details, a communication protocol for internal and external stakeholders including status update templates and frequency, a step-by-step triage and diagnosis process, a mitigation and recovery plan with rollback options and feature flag procedures, a post-mortem framework with blameless analysis, timeline reconstruction, root cause identification and preventive action items. The template also includes a section for documenting lessons learned and tracking improvement metrics such as MTTD (Mean Time to Detect), MTTR (Mean Time to Recover) and incident frequency per category. Additionally there is space for recording runbooks per critical service: step-by-step diagnosis instructions that enable any on-call engineer to assess an incident, even when they are not the original builder of the system. The template also helps teams build an incident history that over time reveals patterns and enables proactive improvements before problems recur. The template also provides guidelines for communicating with end users during an incident, including templates for status page updates that come across as professional and transparent without causing unnecessary panic.

Variations

SaaS Incident Response Plan

Incident response plan specifically for SaaS platforms covering multi-tenant impact, status page communication, SLA impact calculation, customer notification workflow and data breach specific procedures in compliance with GDPR requirements.

Best for: Suited for SaaS companies with contractual uptime obligations that must communicate quickly and transparently to hundreds or thousands of customers during incidents.

DevOps On-Call Runbook

Operational handbook for on-call engineers with step-by-step diagnosis procedures per alert type, dashboards and log sources per service, known issues with workarounds and escalation criteria for when help is needed.

Best for: Ideal for teams with an on-call rotation that need to act quickly on alerts, especially when the on-call engineer is not the original builder of the affected service.

Security Incident Response

Specialised incident response plan for security incidents: data breaches, unauthorised access, malware and DDoS attacks. Includes forensic preservation procedures, notification checklists (GDPR, NIS2) and communication with authorities.

Best for: Mandatory for organisations processing personal data that must comply with GDPR notification requirements. Essential for managing security incidents where legal and compliance aspects play a role.

Blameless Post-Mortem Template

Structured post-mortem format focused on learning rather than blame: incident timeline, impact description, root cause analysis with Five Whys or Ishikawa diagram, what went well, what can be improved and concrete action items with owner and deadline.

Best for: Suited for any organisation that wants to use incidents as learning moments and build a culture where transparency and improvement matter more than assigning blame.

Incident Communication Playbook

Template specifically for the communication aspects of incidents: internal Slack messages, customer emails, status page updates, social media responses and press statements. Per incident severity a communication timeline and example texts.

Best for: Perfect for teams where communication during incidents is just as important as the technical fix, especially for customer-facing products where trust is at stake.

How to use

Step 1: Download the incident response template and adapt it to your organisation, technical infrastructure and team composition. Define incident severity levels (P1 through P4) with concrete criteria so everyone speaks the same language when classifying incidents. P1 is complete outage for all users, P4 is a cosmetic issue without functional impact. Step 2: Establish the escalation matrix with three core roles: the Incident Commander (coordinates the incident, makes decisions), the Technical Lead (performs diagnostics and recovery) and the Communications Lead (informs stakeholders and customers). Fill in the primary and secondary contact person per role, including phone number and availability. Step 3: Define the communication protocol per severity level. For P1 incidents send an immediate internal notification to the entire team and start the status page update. For P2 inform the team and direct stakeholders. For P3 and P4 a ticket in the backlog suffices. Determine update frequency: every 15 minutes for P1, every hour for P2. Step 4: Document the triage procedure: how do you identify the affected systems, which dashboards and logs do you consult first, how do you confirm the impact scope and how do you isolate the problem? Create a specific diagnosis path per critical service with the most common failure modes. Step 5: Describe mitigation options per scenario: rollback to previous version, disable feature flag, redirect traffic to fallback, database recovery, cache invalidation. Define in advance when each option is appropriate so you do not lose time on decision-making during an incident. Step 6: After resolving the incident schedule a blameless post-mortem within 48 hours. Reconstruct the timeline from detection to recovery, identify the root cause via Five Whys analysis, document what went well and what can improve, and define concrete action items with owner and deadline. Step 7: Archive the post-mortem report and share it with the entire team. Review incident metrics monthly (MTTD, MTTR, incident frequency) to identify trends and implement structural improvements. Step 11: Prepare a communication template for external status updates that you can publish on your status page or via email. Draft three variants: an initial message at incident start, an interim update with progress and a closing message upon recovery. Step 12: Schedule a blameless post-mortem session within 48 hours after every significant incident. Document the timeline, root cause, actions taken and improvement measures that should prevent recurrence. Share the report with the entire team.

How MG Software can help

MG Software helps teams set up a complete incident management process: from configuring monitoring and alerting (Datadog, Sentry, PagerDuty) to training teams in incident response procedures. We facilitate post-mortem sessions, help identify structural improvement points and advise on SRE best practices that measurably improve your platform reliability. Our experience with production incidents across diverse clients enables us to organise realistic scenario exercises. We also help set up status page communication so your end users always stay informed about recovery progress during an incident. Additionally we advise on structuring on-call rotations that are sustainable for your team, including compensation arrangements and escalation paths that prevent any single engineer from becoming overloaded. Our team has experience building runbooks per critical service, so that engineers who are not the original builders can still effectively diagnose and resolve incidents. We also offer quarterly reviews where we jointly analyse incident metrics and prioritise concrete improvement initiatives.

Frequently asked questions

Want this template implemented now?

We set it up for you, production-ready and tailored to your brand and workflow.

Request a quote

Deployment Checklist Template - Free Download & Example

Never miss a step during production releases. Deployment checklist with pre-flight checks, rollback plan, monitoring setup, canary procedures and post-deployment verification.

Stakeholder Report Template - Free Download & Example

Keep stakeholders effectively informed about project progress. Stakeholder report template with progress overview, risk matrix, budget status and timeline.

Security Audit Template - Free Download & Example

Identify vulnerabilities before attackers do. Security audit template with OWASP Top 10 checklist, penetration test scope and remediation planning.

Security Scanners That Catch Vulnerabilities Before Production

Dependency vulnerabilities are the fastest path to a breach. We evaluated 6 security scanning tools on detection speed, false positives, and CI integration.

From our blog

NIS2 and the Dutch Cybersecurity Act: What It Changes for Software Suppliers

Sidney de Geus · 13 min read

OpenClaw: The Open-Source AI Assistant That Took Over GitHub in Weeks

Sidney · 8 min read

OpenAI Codex Security: AI-Powered Vulnerability Scanning That Found 11,000 Critical Bugs in Beta

Sidney · 7 min read

Incident Response Template - Free Download & Example

Respond to production incidents with structure. Incident response template with escalation matrix, communication protocol, root cause analysis and post-mortem framework.

Variations

SaaS Incident Response Plan

Best for: Suited for SaaS companies with contractual uptime obligations that must communicate quickly and transparently to hundreds or thousands of customers during incidents.

DevOps On-Call Runbook

Best for: Ideal for teams with an on-call rotation that need to act quickly on alerts, especially when the on-call engineer is not the original builder of the affected service.

Security Incident Response

Blameless Post-Mortem Template

Best for: Suited for any organisation that wants to use incidents as learning moments and build a culture where transparency and improvement matter more than assigning blame.

Incident Communication Playbook

Best for: Perfect for teams where communication during incidents is just as important as the technical fix, especially for customer-facing products where trust is at stake.

How to use

How MG Software can help

Frequently asked questions

Want this template implemented now?

We set it up for you, production-ready and tailored to your brand and workflow.

Request a quote

Deployment Checklist Template - Free Download & Example

Never miss a step during production releases. Deployment checklist with pre-flight checks, rollback plan, monitoring setup, canary procedures and post-deployment verification.

Stakeholder Report Template - Free Download & Example

Keep stakeholders effectively informed about project progress. Stakeholder report template with progress overview, risk matrix, budget status and timeline.

Security Audit Template - Free Download & Example

Identify vulnerabilities before attackers do. Security audit template with OWASP Top 10 checklist, penetration test scope and remediation planning.

Security Scanners That Catch Vulnerabilities Before Production

Dependency vulnerabilities are the fastest path to a breach. We evaluated 6 security scanning tools on detection speed, false positives, and CI integration.

From our blog

NIS2 and the Dutch Cybersecurity Act: What It Changes for Software Suppliers

Sidney de Geus · 13 min read

OpenClaw: The Open-Source AI Assistant That Took Over GitHub in Weeks

Sidney · 8 min read

OpenAI Codex Security: AI-Powered Vulnerability Scanning That Found 11,000 Critical Bugs in Beta

Sidney · 7 min read

Incident Response Template - Free Download & Example

Variations

SaaS Incident Response Plan

DevOps On-Call Runbook

Security Incident Response

Blameless Post-Mortem Template

Incident Communication Playbook

How to use

How MG Software can help

Frequently asked questions

Want this template implemented now?

Related articles

From our blog

Incident Response Template - Free Download & Example

Variations

SaaS Incident Response Plan

DevOps On-Call Runbook

Security Incident Response

Blameless Post-Mortem Template

Incident Communication Playbook

How to use

How MG Software can help

Frequently asked questions

Want this template implemented now?

Related articles

From our blog