Incident Response
Practical incident playbook for detection, triage, containment, communication, and recovery.
9 min read
Use this playbook for high-impact delivery failures or broad degradation.
Purpose
This guide provides a repeatable incident flow:
- Detect and declare.
- Scope and triage.
- Contain impact.
- Communicate updates.
- Recover and review.
Prerequisites and permissions
- Access to Activity, endpoint controls, and target controls.
- On-call ownership for affected endpoints.
Step-by-step workflow
1. Detect and declare
Trigger incident mode for patterns such as:
- Sustained
DELIVERY_FAILEDincrease. - Widespread auth/validation rejections.
- Business-critical flow interruption.
Capture incident start time and affected endpoint IDs.
2. Scope impact
- Filter Activity to affected endpoints.
- Classify blast radius:
- Single endpoint
- Single provider/destination
- Multi-endpoint/systemic
- Estimate volume and customer impact.
3. Contain quickly
Prefer smallest safe change first:
- Disable problematic endpoint.
- Detach/fix failing target.
- Roll back recent config changes.
UI cue: endpoint enable toggle and target edit actions are the fastest containment tools.
4. Communicate updates
Share at minimum:
- What is failing (outcomes/endpoints)
- Mitigation in progress
- Next update time
For external help, open Support with timestamps, endpoint IDs, and error details.
5. Recover and verify
- Validate fix with controlled traffic.
- Confirm success outcomes recover.
- Keep elevated monitoring until stable.
6. Post-incident review
Document:
- Root cause
- Detection gap
- Recovery timeline
- Preventive actions and owners
Expected result and verification checks
- Impact is contained quickly.
- Delivery health returns to baseline.
- Follow-up actions are assigned.
Common issues and fixes
- Over-scoping: isolate by endpoint first.
- Early closure: require sustained healthy outcomes.
- Weak evidence: capture outcome/error snapshots before edits.