Written by: Nimesh Chakravarthi, Co-founder & CTO, Struct
Key Takeaways
- Poorly designed on-call rotations create alert fatigue and burnout. Teams need to revisit schedules as headcount and alert volume grow.
- This seven-step playbook walks managers through baselining alert volume, mapping time zones, choosing rotation models, defining primary and secondary roles, configuring PagerDuty layers, building escalation policies, and planning overrides.
- Effective schedules rest on documented baselines, rotation models that match team structure, clear escalation paths, and monthly metric reviews tied to thresholds like pages per shift and MTTA.
- Solo on-call, undersized rotations, noisy alerts, and missing recovery time all increase burnout and turnover. Fixing these issues prevents extended incidents and attrition.
- Even a strong schedule still leaves engineers doing manual triage. Use Struct to automate your on-call runbook and cut investigation time by 80%.
Design Your First Effective Rotation (7 Steps)
- Baseline alert volume and team capacity. Establish a factual starting point before touching PagerDuty. Involve the engineering manager and at least one senior IC. Required input: 90 days of PagerDuty incident history. Output: pages-per-week average and a list of alerts that have not required human action in the last 90 days. This work takes time but prevents building a rotation around noise. Pruning non-actionable alerts can cut pager load by 30–40%, which can change the rotation model you need.
- Map time-zone distribution. Decide whether a co-located or follow-the-sun model fits your team. Involve all engineers who will carry a pager. Required input: team roster with home time zones. Output: a coverage map showing gaps. Follow-the-sun coverage reduces overnight work but requires reliable handoff procedures and enough engineers in each region.
- Choose a rotation model. Select the shift structure that matches team size and coverage requirements. A practical framework recommends weekly rotations for teams under five engineers in a single timezone, weekly or daily rotations for five-to-ten engineers, and follow-the-sun or daily rotations for ten or more engineers across multiple time zones. Output: a documented rotation model decision. Weekly shifts are predictable but concentrate risk. Daily shifts distribute load but increase handoff overhead.
- Define primary and secondary roles. Remove single points of failure. Every production rotation needs both a primary and a secondary on-call engineer per shift, with pages escalating automatically after a defined timeout. Output: two named responders per shift block in PagerDuty. This approach requires a rotation large enough to staff two roles without overloading individuals. Google recommends a minimum on-call rotation size of no fewer than 8 engineers.
- Configure schedule layers and restrictions. Encode the rotation model in PagerDuty using layers and time restrictions. Involve the team lead responsible for PagerDuty administration. Required input: rotation model decision, team roster, and time-zone map. Output: a configured PagerDuty schedule with at least a primary layer and a secondary layer. Additional layers add coverage precision but increase administrative complexity.
- Build escalation policies. Ensure every unacknowledged page reaches a human within a defined window. Standard escalation timing routes from primary to secondary after five minutes, then to the engineering manager after ten more minutes, then to the VP of Engineering after fifteen more minutes, with faster thresholds for P1 incidents. Output: a documented escalation policy linked to every service in PagerDuty.
- Plan overrides and handoff documentation. Handle holidays, vacations, and unexpected absences without coverage gaps. Engineers need several weeks of advance notice so they can plan personal commitments around on-call duties. Output: a written handoff checklist covering active incidents, recent deployments, system health dashboards, scheduled maintenance, and known issues. Thorough documentation takes time to maintain but directly reduces incident reopening rates.
Implementing the Seven Steps
Steps 1–2: Baseline and Map Your Team
Step 1 called for baselining alert volume and team capacity. Step 2 focused on mapping time zones and coverage. This section shows how to execute that baseline in practice.
Document the baseline before configuring a single PagerDuty schedule layer. Collect 90 days of alert volume, map every engineer’s time zone, and record current mean time to acknowledge (MTTA) and mean time to resolve (MTTR). This data guides the rotation model and clarifies whether the team is co-located or globally distributed.
The choice between co-located and follow-the-sun models hinges on four structural trade-offs that directly affect handoff complexity and minimum team size:
| Dimension | Co-located Team | Global / Follow-the-Sun Team |
|---|---|---|
| Time zones covered | 1–2 overlapping zones | 3+ non-overlapping zones (e.g., Americas, EMEA, APAC) |
| Overnight coverage method | Rotating pager duty, engineers may be woken | Each region owns its daylight hours, no engineer works overnight. |
| Handoff complexity | Low, same-timezone verbal or Slack handoff | High, requires explicit documentation of regional traffic patterns, infrastructure differences, and escalation contacts per region |
| Minimum viable team size | At least 8 engineers per rotation | 3 engineers per shift (9 total across three regions) |
Once you have this baseline and a clear decision on co-located versus follow-the-sun coverage, you can move into PagerDuty configuration with confidence.
Steps 5–7: Configure PagerDuty and Plan Handoffs
Steps 5 through 7 translate your rotation model into concrete PagerDuty schedules, escalation policies, and handoff practices. This section explains how schedule layers, restrictions, and overrides work together.
PagerDuty schedule layers let teams stack multiple rotations on a single schedule. Use a primary layer for your main rotation and a secondary layer for backup coverage. Add a third layer only when the first two are consistently saturated, not preemptively.
When to add a layer versus shorten shifts: start by checking whether engineers miss acknowledgments because they are asleep during their shift window. If that happens regularly, add a layer or switch to follow-the-sun coverage. Before making that change, verify that alert volume is reasonable. If actionable incidents per shift consistently exceed two to three, audit and prune alerts first, because extra layers will not fix a noisy alerting stack. If alert volume looks acceptable but a single bad week still ruins one engineer’s month, shorten shifts from weekly to daily or use split shifts. When handoff overhead is already high and context loss is frequent, lengthen shifts instead of adding layers. If the team has fewer than six engineers, none of these adjustments will work well, so limit coverage to business hours or partner with another team before attempting 24/7 rotations.
Schedule restrictions in PagerDuty control when a layer is active. For example, you can restrict a layer to weekday business hours and use a separate layer for nights and weekends. This feature supports split-shift coverage and follow-the-sun on-call rotation patterns.
Holiday and vacation overrides should be created several weeks in advance. Document holiday traffic pattern changes as upcoming events during handoffs to prevent confusion when expected load variations occur. For global teams, write handoff documentation in simple, direct language and include screenshots or diagrams to reduce ambiguity across regions.
Some teams still rely on spreadsheets for on-call schedules. Spreadsheets often break down as teams grow because updates are missed and identifying the current on-call during an incident takes too long. An on-call rotation schedule template in Excel works for small teams. However, for teams of around 20 engineers, PagerDuty’s per-user pricing often makes lower-cost alternatives preferable for on-call management.
How PagerDuty Scheduling Fits Into Engineering Operations
A PagerDuty schedule only works as well as the systems feeding it. The schedule defines who receives a page. Escalation policies define what happens if they do not respond. Post-incident reviews reveal whether the rotation design itself slowed the response.
Map your schedule to four vendor-neutral integration categories so you can see the full incident picture.
- Observability platforms (Datadog, Grafana, Prometheus, CloudWatch). These tools generate the alerts that trigger PagerDuty. Noisy observability configurations create noisy rotations.
- Ticketing systems (Jira, Linear, Asana). Link PagerDuty incidents to tickets so post-incident reviews have a complete audit trail.
- Source control (GitHub). Correlate deployments with incident spikes to identify whether a release caused a surge in pages.
- Chat (Slack). Route PagerDuty notifications to the correct Slack channel so the on-call engineer has immediate context and the team has visibility.
Struct connects across all four categories. When an alert fires in PagerDuty or Slack, Struct automatically investigates, correlating logs from Datadog, CloudWatch, or GCP, mapping exceptions from Sentry, and cross-referencing GitHub commits. The on-call engineer opens a completed root-cause report instead of a blank terminal. See how Struct connects to your observability stack and eliminate the manual first-pass entirely.
Measurement and Continuous Improvement
Schedule design needs regular review because team size, coverage needs, and alert volume change over time. Treat the rotation as a living system rather than a one-time configuration.
Track these metrics monthly using PagerDuty analytics and tie each one to specific design decisions.
- Pages per shift: target 0-2 for a healthy rotation. Higher values suggest noisy alerts or an overloaded rotation model.
- After-hours pages: keep these as low as possible. Frequent night pages signal a need for follow-the-sun coverage or better alert tuning.
- Mean time to acknowledge (MTTA): aim for a low MTTA. Rising MTTA can indicate unclear ownership, alert fatigue, or unrealistic expectations for overnight response.
- Escalation rate: keep this low. High escalation rates may indicate coverage gaps, unclear responsibilities, or engineers ignoring pages due to overload.
- Override frequency: monitor this closely. High override frequency often means the rotation design does not match real availability or personal constraints.
Recovery-day math provides a concrete rule for sustainable load. If an engineer is paged more than twice after midnight during a weekly shift, that shift should include a compensatory recovery day. A team running weekly rotations with an average of four after-hours pages per shift is effectively asking engineers to work six days out of seven. Over time, quarterly reviews will show rising MTTR and increasing override frequency. Those signals mean you should shorten shifts or add a layer before the next rotation cycle.
Common Pitfalls and Practical Fixes
- Solo on-call with no backup. Teams without clear escalation paths risk extended incidents if the on-call engineer is unavailable. Fix: always configure a secondary layer and a documented escalation policy.
- Undersized rotation. Google’s recommendation for the minimum on-call rotation size is no fewer than 8 engineers. Teams below this threshold should limit coverage to business hours instead of burning out a small group with 24/7 duty.
- Publishing schedules with insufficient lead time. Without several weeks of advance notice, engineers cannot plan personal commitments around on-call duties, which leads to last-minute override requests and coverage gaps. Fix: generate and publish the next quarter’s schedule at the start of each quarter.
- Building complex rotations around noisy alerts. Teams consistently seeing eight to ten incidents per shift should audit their alerting stack instead of adding schedule layers. Fix: prune alerts that have not required human action in 90 days before restructuring the rotation.
- No compensation or recovery time. Some teams have experienced engineers quitting due to on-call burden when compensation or recovery time is absent. Fix: establish explicit compensation policies plus recovery time for overnight pages.
Even a well-designed schedule still leaves engineers doing manual triage the moment a page fires. Eliminate manual triage with Struct so that by the time the on-call engineer acknowledges the page, the root cause, blast radius, and suggested fix are already waiting in their Slack thread.
Conclusion
Effective PagerDuty on-call schedule design requires a documented baseline, a rotation model matched to team size and time-zone distribution, explicit primary and secondary roles, configured escalation policies, and a regular review cadence tied to measurable thresholds. These seven steps address structural causes of alert fatigue and burnout. The next layer of improvement is automation, which removes the manual investigation work that consumes 30–45 minutes per incident even when the schedule is well designed. Struct integrates with PagerDuty, Slack, and your observability stack in under ten minutes and delivers a completed root-cause report before the engineer finishes reading the alert. Give your team their nights back with Struct and reduce after-hours pages that require manual investigation.