Written by: Nimesh Chakravarthi, Co-founder & CTO, Struct
Key Takeaways
- Datadog OnCall centralizes on-call scheduling, escalation policies, and alert routing inside the Datadog platform, so teams avoid separate paging tools.
- Manual triage after an alert fires, such as hunting logs and correlating traces, still consumes significant time that Datadog OnCall alone does not remove.
- Struct acts as an AI agent that automatically root-causes alerts by analyzing metrics, logs, traces, monitors, and code, then delivers investigation reports before engineers open their laptops.
- Integrating Struct with Datadog OnCall can reduce triage time by up to 80%, so junior engineers can handle incidents with senior-level context from the first alert.
- Teams ready to remove manual investigation steps can automate their on-call runbook with a Struct demo.
How Datadog OnCall Fits Into Your Incident Workflow
Datadog OnCall centralizes on-call scheduling, escalation policies, and alert routing inside the Datadog observability platform, so engineering teams manage rotations, acknowledge incidents, and trigger postmortems from a single interface. Datadog maintains more than 1,000 integrations, which makes it a natural hub for teams already running monitors, APM, and log management in the same toolchain.
The gap Datadog OnCall does not close is the manual triage work that happens after a page fires, such as hunting logs, correlating traces, and assembling context across tools. That is where Struct operates. Struct is an AI agent that automatically root-causes engineering alerts by pulling and analyzing metrics, logs, traces, monitors, and code, then delivers a full investigation report before an engineer opens a laptop.
The following five steps walk through the complete Datadog OnCall workflow, from initial schedule configuration through postmortem generation. Each step includes a Struct automation overlay that shows how AI investigation removes manual triage work at that stage.
1. Configure On-Call Schedules and Escalation Policies
Start by creating predictable rotations so every alert reaches a responsible owner quickly. In Datadog, navigate to On-Call → Schedules → New Schedule. Define rotation layers such as weekly, follow-the-sun, or custom, assign team members, and set override windows for vacations and swaps. Schedules should be published at least 4–6 weeks in advance and should always include a secondary or backup layer to avoid single points of failure.
Next, open On-Call → Escalation Policies → New Policy. A standard policy notifies the primary on-call within 5 minutes, then escalates to the secondary within 10 minutes if no one responds, followed by the team lead at 15 minutes and the engineering manager at 30 minutes. These intervals support target metrics such as mean time to acknowledge under 5 minutes and an escalation rate below 10%, so most alerts resolve at the primary layer.
Struct automation overlay: Once schedules are live, point Struct at the same Slack alerting channel Datadog pages into. Struct begins auto-investigating every alert the moment it fires, correlating logs, traces, and code context, so the on-call engineer receives a root-cause summary in Slack before the 5-minute acknowledgment window expires. Junior engineers gain the same starting-point context a senior engineer would assemble manually, which makes it safe to expand the rotation without tribal-knowledge risk.
See how Struct auto-investigates your alerts
2. Connect Monitors to Teams and Set Notification Channels
Route each alert to the right people through the right channels. In Monitors → Edit Monitor → Notify your team, add the escalation policy handle, such as @oncall-policy-production, to the alert message. This handle defines who receives the page. Next, set separate notification targets for alert, warning, and recovery states so teams can tune noise levels by severity.
Finally, configure notification channels such as Slack, PagerDuty, email, and SMS inside On-Call → Teams → Notification Rules. Severity-based rules can use phone, SMS, Slack, and email with repeated alerts every 5 minutes for SEV1, SMS and Slack every 10 minutes for SEV2, and Slack and email during business hours only for SEV3. This structure keeps critical incidents loud while keeping lower-severity noise manageable.
Struct automation overlay: Struct listens to the same Slack channels that receive Datadog notifications, so no extra routing setup is needed. When a monitor fires, Struct immediately queries Datadog metrics, AWS CloudWatch or GCP logs, Sentry exceptions, and GitHub code history in parallel. It then posts an impact summary that covers blast radius, affected users, and probable root cause directly in the alert thread before the first human acknowledgment.
3. Acknowledge Alerts and Perform Mobile Triage
On-call engineers need a fast way to claim ownership and start triage from anywhere. Engineers acknowledge pages through the Datadog mobile app using On-Call → My Pages → Acknowledge or via the Slack notification action button. Acknowledgment stops escalation and signals that someone is actively working the incident.
Initial triage usually involves opening the linked monitor, reviewing the metric graph, and pivoting to Logs or APM to identify the failure surface. Teams without automation often spend the first several minutes gathering this context instead of testing fixes.
Struct automation overlay: By the time an engineer taps Acknowledge, Struct has already completed the context-gathering phase. The Slack thread contains a dynamically generated dashboard link with supporting charts pulled from Datadog, a unified timeline that merges events across the stack, and suggested fixes. Engineers can tag Struct in the thread to pull logs from a specific time window, test an alternative hypothesis, or verify whether a specific user is impacted, all without leaving Slack.
Try Struct’s Slack-native triage assistant
4. Declare Incidents and Automate Communication
Clear incident declaration keeps stakeholders aligned during high-impact events. For P1 and P2 incidents, open Monitors → Declare Incident or use the /datadog incident Slack command. Set severity, assign an incident commander, and add a public status-page component if needed. Datadog then creates a dedicated Slack channel, a timeline, and a Zoom bridge. Update the incident status as triage progresses using Incident → Status → Update.
Struct automation overlay: Struct posts its root-cause report directly into the auto-created incident Slack channel, which gives the incident commander immediate context for customer communications. Struct performs regression analysis, correlates anomalies, and generates impact summaries and incident reports, replacing the manual investigation phase that often consumes the first half-hour of a P1 bridge call.
5. Analyze Telemetry and Generate Postmortems
Postmortems turn each incident into a durable improvement. After resolution, open Incident → Postmortem → Generate. Datadog pre-populates the document with the incident timeline, monitor states, and linked dashboards. Teams then add contributing factors, action items, and owners before publishing.
MTTR, MTTA, MTBF, and MTTF are complementary metrics that together provide a complete view of incident-management performance. Teams should review these seven core incident metrics weekly and focus on trends rather than isolated spikes, so they can see whether process changes actually improve reliability.
Struct automation overlay: Struct’s investigation report, which includes correlated log queries, anomaly timelines, and root-cause evidence, flows directly into the postmortem draft. This connection reduces the time spent reconstructing what happened. Once the root cause is confirmed, Struct can hand off context to a coding agent or generate a pull request, which closes the loop from alert detection to code resolution.
Use Struct to speed up your postmortems
Comparison: MTTD and MTTR With vs. Without Automation
The table below quantifies how teams improve performance when they pair Struct’s automated investigation with Datadog OnCall’s routing. Focus on the drop in triage time and the resulting impact on MTTD and MTTR targets.
| Metric | Without Automation | With Struct + Datadog OnCall | Source |
|---|---|---|---|
| Investigation / triage time | 30–45 min per incident | Significantly reduced per incident | Struct |
| Triage time reduction | Baseline | 80% reduction | Struct customer data |
| MTTD target (critical systems) | Varies by system | Under 5 min (automated detection + correlation) | TaskCall |
| MTTR target (SEV-1) | Often exceeds 1 hour without tooling | Can be reduced for high-performing teams | TaskCall |
Severity Matrix: P1–P4 Incident Handling
The severity definitions below set the notification and response expectations used throughout the five-step workflow. They show how escalation speed and MTTR targets scale with incident impact.
| Severity | Definition | Notification Rules | Response Expectation |
|---|---|---|---|
| P1 (SEV1) | Full outage or critical customer-facing failure | Phone + SMS + Slack + email, repeat every 5 min until acknowledged | Acknowledge within 5 min, incident declared immediately, MTTR target <1 hr |
| P2 (SEV2) | Significant degradation, partial customer impact | SMS + Slack + email, repeat every 10 min | Acknowledge within 10 min, MTTR target <4 hrs |
| P3 (SEV3) | Minor degradation, limited user impact | Slack + email during business hours only | Acknowledge within 30 min during business hours, MTTR target <24 hrs |
| P4 (SEV4) | Cosmetic or low-impact issue, no SLA risk | Slack notification, no paging | Triaged next business day, tracked in backlog |
Notification rules for P1–P3 are drawn from OneUptime’s 2026 on-call guide. MTTR targets are drawn from TaskCall’s 2026 incident management metrics guide.
How Struct Integrates with Datadog OnCall
Struct connects to Datadog OnCall in under 10 minutes by authenticating three integration points: the Slack alerting channel, the Datadog API for metrics and monitors, and the code repository such as GitHub. No agent installation or infrastructure change is required.
After setup, Struct operates as a Slack-native conversational AI that sits on top of your existing observability stack. Engineers interact with it directly in the alert thread by asking follow-up questions, requesting additional log windows, or testing hypotheses without switching tools. Struct generates incident-specific dashboards, pulling charts from Datadog, traces from APM, and exceptions from Sentry into a single evidence view.
Teams can encode their existing on-call runbooks directly into Struct, so the AI follows company-specific investigation procedures for every alert type. When root cause is confirmed, Struct hands off the full context to a coding agent or generates a pull request, which completes the loop from page to patch.
Companies like FERMAT and Arcana use Struct to investigate thousands of alerts monthly, with large-scale customers achieving the triage improvements outlined earlier. The 2025 SRE Report found that engineers spend a median of 30% of their week on operational work, and Struct targets that block directly by automating the first-pass investigation that consumes much of that time.
Struct is SOC 2 and HIPAA compliant, so it fits fintech, healthtech, and other regulated Series A–C companies already operating inside Datadog’s ecosystem.
Connect Struct to your Datadog stack in 10 minutes
Frequently Asked Questions
What is the minimum tooling maturity required to use Struct with Datadog OnCall?
Struct works best when a team already has basic observability in place. That setup includes Datadog monitors firing into a Slack channel, some form of application logging such as Datadog Log Management, AWS CloudWatch, or GCP Logs, and a GitHub repository. If those three elements exist, Struct can begin producing automated investigations immediately after a 10-minute setup.
Teams without structured logging or trace IDs still receive value, but root-cause output becomes less precise because Struct’s analysis depends on the telemetry data available in the connected sources.
Can Struct be used if our logs cannot leave our VPC?
Struct currently requires API-level access to log and observability sources such as Datadog, AWS, and GCP to perform automated investigations. Organizations with strict enterprise policies that block any log data from leaving an internal network and require fully on-premise deployment are not a fit for Struct’s current architecture. For most Seed-to-Series-C companies, SOC 2 and HIPAA compliance cover requirements, and logs are accessed and processed ephemerally.
How does Struct help junior engineers handle on-call shifts safely?
Struct acts as an automated senior engineer for the first pass of every incident. When an alert fires, Struct delivers a fully contextualized investigation report that includes blast radius, correlated log evidence, root cause, and suggested fixes before the on-call engineer takes any manual action.
Junior engineers receive the same starting-point context that a senior engineer would spend 30–45 minutes assembling manually. Teams can also encode their internal runbooks directly into Struct, so the AI follows company-specific procedures for every alert type and reduces the tribal-knowledge gap that usually prevents new hires from owning on-call rotations.
How does Struct handle compliance requirements for regulated industries?
Struct is fully SOC 2 and HIPAA compliant. Log data accessed during an investigation is processed ephemerally and is not stored beyond the investigation session. This posture covers the requirements of most Series A–C companies in regulated verticals such as fintech and healthtech.
Enterprise teams with additional compliance needs, such as FedRAMP or on-premise deployment, should contact Struct directly to discuss sidecar or on-prem support options available under the Enterprise plan.
How long does it take to see value after connecting Struct to Datadog OnCall?
Setup takes under 10 minutes. Teams authenticate Slack, connect Datadog and a log source, and optionally link GitHub. Struct then begins auto-investigating the next alert that fires in the configured channel.
Most teams see their first automated root-cause report within the same day. The 80% triage-time reduction mentioned earlier appears from the first investigation, because Struct replaces the manual log-hunting phase entirely rather than adding a small assist.
Conclusion
Datadog OnCall provides a solid foundation for on-call scheduling, escalation policies, and incident declaration. The five-step workflow above, which covers schedules, monitor routing, acknowledgment, incident declaration, and postmortem, spans the full lifecycle of a production incident inside Datadog’s UI. The persistent gap is the manual triage work between alert and root cause, which the earlier sections showed takes 30–45 minutes per incident and scales poorly as teams grow.
Struct closes that gap by automating the entire first-pass investigation the moment a Datadog monitor fires. The 80% triage reduction mentioned at the outset turns a 45-minute investigation into a 5-minute review. The result is lower MTTR, protected SLAs, reduced on-call burnout, and a rotation that junior engineers can own confidently from day one.