Incident Response for Small Teams: Runbooks, Alerts, and Post-Mortems

The first production incident most small teams experience goes roughly the same way. Something breaks. Someone notices from a user complaint or a red alert. Everyone piles into a Slack channel. Three people are making changes simultaneously with no coordination. One person is trying to diagnose while another person is already reverting a deployment. Communication with affected users is inconsistent or nonexistent. The incident resolves through a combination of correct action and luck, and nobody is sure which. Forty-five minutes of chaos produces a five-minute fix.

This does not have to be how it goes. Even a three-person team can have a functional incident response process. Here is how I think about it.

Define What an Incident Is

Before you can respond to incidents consistently, you need a shared definition of what counts as an incident. Not every alert is an incident. Not every user complaint requires incident declaration.

A simple severity taxonomy:

SEV-1 (Critical) — complete service outage, data loss or corruption, security breach, payment processing failure. All hands. Immediate response. Customer communication within 15 minutes.

SEV-2 (Major) — significant functionality degraded, key user flows broken but workarounds exist, performance degraded enough to affect user experience measurably. On-call engineer responds within 30 minutes.

SEV-3 (Minor) — minor feature broken, cosmetic issues, single-user issues. Normal ticket queue, addressed in next sprint if not blocking.

Having this defined in advance removes the debate during an active incident about whether something is "serious enough." Someone declares a SEV-1 or SEV-2 and the response process kicks in automatically.

The On-Call Rotation

For a team of any size, someone needs to be designated as the person who responds to alerts outside business hours. That responsibility needs to rotate so it does not fall permanently on one person.

PagerDuty, OpsGenie, and Grafana OnCall all handle on-call scheduling and alert routing. OpsGenie's free tier covers small teams. Grafana OnCall is open-source and integrates with Grafana's monitoring stack.

Configure your alerting tool to page the on-call person for SEV-1 and SEV-2 alerts. A SEV-1 should page immediately and escalate to a second person after 5 minutes of no acknowledgment. The escalation ensures alerts do not go unacknowledged because the on-call person is unavailable.

Set expectations explicitly: on-call means you respond to SEV-1 alerts within 15 minutes, 24/7, during your rotation. Off-call means you are not responsible for after-hours alerts. Rotate weekly, and compensate the on-call burden explicitly (comp time, pay, whatever fits your team). Unacknowledged on-call burden causes burnout.

Runbooks: The Playbooks for Common Incidents

A runbook is a documented procedure for a specific operational scenario. When an on-call engineer gets paged at 2am, a runbook means they can follow a tested procedure rather than improvising from first principles while half-asleep.

Runbooks do not need to be elaborate. A runbook for "API high error rate" might be:

Runbook: API High Error Rate (SEV-2)

Trigger: Error rate > 2% for 5+ minutes

Diagnosis:

Check recent deployments: https://github.com/myorg/api/deployments
Check error distribution in logs: Axiom query level:error | count() by statusCode (last 15 minutes)
Check database connection pool: SELECT count(*) FROM pg_stat_activity WHERE state = 'active'
Check external API status pages: Stripe (status.stripe.com), SendGrid (status.sendgrid.com)

Common causes and resolution:

Recent deployment broke something → roll back: kubectl rollout undo deployment/api -n production
Database connection pool exhausted → restart API pods: kubectl rollout restart deployment/api -n production
Third-party API down → check if error is isolated to those endpoints, communicate to users if so
Increased traffic causing overload → scale up replicas: kubectl scale deployment api --replicas=5 -n production

Escalation: If not resolved in 30 minutes, escalate to Engineering Lead.

Communication: Post status update in #status Slack channel every 15 minutes.

This runbook is minimal but complete. An engineer who has never seen this specific failure before can follow it and resolve the most common causes. Build runbooks for your most common alert scenarios. Update them when incidents reveal gaps.

Communication During an Incident

Poor communication during incidents erodes user trust faster than the outage itself. Users who know what is happening and when to expect resolution are more forgiving than users who get no information.

Designate a communicator role in SEV-1 incidents — one person whose job during the incident is customer communication, not technical diagnosis. They post to your status page, respond to support tickets, and update the #status channel. Technical engineers focus on resolution without context-switching to communication.

Your status page should be on a separate hosting provider from your application. If your application is down, your status page needs to still be up. Statuspage.io, BetterUptime, and Instatus all provide externally hosted status pages with automatic incident posting.

Communication cadence: acknowledge within 15 minutes of declaration, update every 30 minutes until resolved, and post a resolution update when the incident ends.

Template for acknowledgment:

We are aware of an issue affecting specific feature/service. Our engineering team is investigating. We will provide an update within 30 minutes.

Template for resolution:

The issue affecting specific feature/service has been resolved as of time. Affected users experienced brief description. We will follow up with a full post-mortem within 48 hours.

The Post-Mortem

The post-mortem is the most important part of incident response, and the most skipped. If you fix the immediate problem and never understand why it happened, you will fix the same problem again.

Post-mortems are blameless. The goal is to understand systemic failures, not to assign fault to individuals. An engineer who made a mistake that contributed to an incident is someone who was working under conditions that allowed that mistake to reach production. The system failed, not the person.

Post-mortem structure:

Timeline — a chronological record of what happened, when, and who took what action. Build this during or immediately after the incident while memory is fresh.

Root cause — what actually caused the incident? Keep asking "why" until you reach a root cause. "The API was returning errors" is a symptom. "The database connection pool was exhausted" is closer. "We deployed without updating connection pool limits to match the new traffic pattern" is a root cause.

Contributing factors — what conditions made this incident worse or harder to detect? Missing monitoring, unclear runbooks, confusing deployment process.

Action items — specific, assignable tasks that reduce the likelihood or impact of this incident recurring. Each action item has an owner and a deadline. Without this, the post-mortem is documentation, not prevention.

Building the Process Incrementally

You do not need all of this on day one. Build the process incrementally as you encounter incidents.

First incident: write down what happened. That is your first post-mortem. Second incident: designate who is on-call this week. Third incident: write your first runbook for the most common alert. By the time you have had five incidents, you have the skeleton of a real incident response process.

The teams with the best incident response processes got there by taking each incident as an opportunity to improve the process, not as a crisis to survive and forget.

Need help building an incident response process for your team or want a second opinion on your current runbooks and alerting setup? Book a session at https://calendly.com/jamesrossjr.