← Back to documentation

Incident Response

Practical incident playbook for detection, triage, containment, communication, and recovery.

9 min read

Use this playbook for high-impact delivery failures or broad degradation.

Purpose

This guide provides a repeatable incident flow:

Detect and declare.
Scope and triage.
Contain impact.
Communicate updates.
Recover and review.

Prerequisites and permissions

Access to Activity, endpoint controls, and target controls.
On-call ownership for affected endpoints.

Step-by-step workflow

1. Detect and declare

Trigger incident mode for patterns such as:

Sustained DELIVERY_FAILED increase.
Widespread auth/validation rejections.
Business-critical flow interruption.

Capture incident start time and affected endpoint IDs.

2. Scope impact

Filter Activity to affected endpoints.
Classify blast radius:

Single endpoint
Single provider/destination
Multi-endpoint/systemic

Estimate volume and customer impact.

3. Contain quickly

Prefer smallest safe change first:

Disable problematic endpoint.
Detach/fix failing target.
Roll back recent config changes.

UI cue: endpoint enable toggle and target edit actions are the fastest containment tools.

4. Communicate updates

Share at minimum:

What is failing (outcomes/endpoints)
Mitigation in progress
Next update time

For external help, open Support with timestamps, endpoint IDs, and error details.

5. Recover and verify

Validate fix with controlled traffic.
Confirm success outcomes recover.
Keep elevated monitoring until stable.

6. Post-incident review

Document:

Root cause
Detection gap
Recovery timeline
Preventive actions and owners

Expected result and verification checks

Impact is contained quickly.
Delivery health returns to baseline.
Follow-up actions are assigned.

Common issues and fixes

Over-scoping: isolate by endpoint first.
Early closure: require sustained healthy outcomes.
Weak evidence: capture outcome/error snapshots before edits.

Related guides

← Back to documentation