How to Improve AI SRE On-Call Investigations

How to Improve AI SRE On-Call Investigations

Written by: Nimesh Chakravarthi, Co-founder & CTO, Struct

Key Takeaways

  • AI SRE on-call investigations often miss silent failures like model drift, which traditional tools overlook and which waste 45+ minutes per incident on manual log hunts.
  • Use a 7-step playbook with AI observability, drift alerts, and autonomous agents to cut triage time by about 80% to under 5 minutes.
  • Deploy Slack-native AI agents that auto-correlate logs from Datadog, AWS, and Sentry, then generate root cause hypotheses before engineers wake up.
  • Encode custom runbooks and enable conversational queries to remove context-switching and speed up MTTR while reducing engineer burnout.
  • Automate your on-call runbook with Struct for 10-minute setup, SOC2 compliance, and 85-90% helpful investigation rates at scale.

Pinpoint Your Current AI SRE On-Call Gaps

Start by auditing your current on-call reality. Track weekly triage hours, false positive rates, and escalation frequency. Most teams discover they spend 60-80% of incident time on manual investigation, which becomes the exact bottleneck AI agents remove.

AI systems introduce failure modes that traditional SRE practices rarely catch. A 2025 McKinsey Global AI survey found that 51% of organizations using AI experienced at least one negative consequence from AI risks, and nearly one-third reported issues from AI inaccuracy. Your current Datadog and GCP monitoring likely catches infrastructure issues but misses semantic drift, hallucinations, and embedding anomalies.

To identify these gaps in your own environment, create this checklist to quantify your baseline: alert volume per week, average MTTR for P1 incidents, tools that require manual correlation such as Datadog, Sentry, and AWS, and gaps in AI-specific monitoring like concept drift detection. These metrics reveal whether your team operates reactively or proactively. Most teams find they wait for user complaints instead of catching degradation early.

Connect Struct to Your Slack for Instant Auto-Triage

7 Steps to Automate AI SRE On-Call Investigations

Step 1: Deploy an AI-Focused Observability Stack

Layer AI-specific monitoring on top of your existing Datadog and AWS infrastructure. Track stability such as successful model responses versus failures, latency, load, model drift, data drift, and cost metrics like token usage. This creates a reliable data foundation for automated investigations.

Step 2: Configure Targeted Drift Detection Alerts

Set up Prometheus-style alerts for AI-specific failures. Use Population Stability Index (PSI) thresholds, where below 0.1 indicates negligible drift, 0.1 to 0.25 signals moderate drift that needs investigation, and above 0.25 shows significant distributional change that requires immediate action. These thresholds give your team clear triggers instead of vague warnings.

Step 3: Deploy an AI SRE Agent in Slack and PagerDuty

Connect an AI agent directly to your Slack and PagerDuty channels. When alerts fire, the agent pulls logs, correlates traces, and generates root cause hypotheses in under 5 minutes. This work happens before you even open your laptop.

Step 4: Encode Your Team’s Custom Runbooks

Turn your team’s tribal knowledge into executable runbooks. Capture correlation IDs, specific log patterns, and clear escalation procedures. Auto-triggered diagnostic runbooks save 15-30 minutes of investigation time by collecting service health checks and log analysis before engineers acknowledge alerts. This structure makes every incident feel guided instead of ad hoc.

Step 5: Generate Dynamic Incident Dashboards

Once your runbooks execute automatically, you need clear visibility into what they uncover. Create incident-specific dashboards that merge Datadog metrics, AWS traces, and Sentry exceptions into unified timelines. This approach removes context-switching between several tools during outages and keeps everyone aligned on a single view.

Step 6: Enable Conversational AI for Live Triage

Deploy Slack-native AI that responds to natural language queries such as “pull logs from 5 minutes prior” or “check if this impacts user segment X.” This capability replaces manual log hunting during active incidents and keeps engineers focused on decisions instead of data gathering.

Step 7: Automate Handoffs to Code and Postmortems

After the team confirms root cause, automatically generate GitHub pull requests with suggested fixes or hand off context to coding agents for implementation. This automation shortens the gap between diagnosis and remediation and keeps follow-through consistent.

Approach Setup Time Accuracy Triage Reduction
Open-Source Agents Days Varies Varies
Struct 10 minutes Varies Varies
Manual Investigation 42 hrs 91.6% N/A

See How Struct Executes These 7 Steps Automatically

Wire AI Agents into Your Existing Incident Workflow

Effective AI SRE ties directly into the tools your team already uses. PagerDuty alerts trigger Slack notifications, which activate AI agents that query Datadog and Sentry, then post findings back to Slack with GitHub pull request links for fixes. This flow keeps the entire incident lifecycle inside your communication hub.

Struct’s edge comes from being Slack-native, so you avoid tool switching during 3 AM incidents. Datadog’s Bits AI SRE supports direct triage actions such as sending Slack messages, creating incidents, and generating Jira tickets with prefilled context to reduce context switching. Struct extends this pattern across your full stack and runbooks.

Design unified handoff workflows where AI investigations automatically populate post-mortem templates and create follow-up tickets. This approach closes the loop from detection to resolution while keeping manual overhead low.

Measure AI SRE Impact and Improve Over Time

Track four key metrics: triage time reduction with a target of about 80%, MTTR improvement, investigation accuracy rate, and engineer satisfaction scores. Struct’s fintech customers achieved the 80% reduction mentioned earlier, with real-world validation across multiple incident types.

Set weekly review cadences and run chaos engineering simulations to test your automated workflows. Mature implementations achieve 75-82% MTTR reductions over 12-20 months through iterative improvements. These reviews keep your AI agents aligned with evolving systems and failure modes.

Common Pitfalls, Proven Practices, and AI SRE Tool Comparison

Avoid reactive AI usage such as manually prompting ChatGPT during incidents. That pattern hits context limits, depends on copy-paste workflows, and needs constant human guidance. Proactive agents instead start investigations automatically when alerts fire and arrive with context already assembled.

Follow a few best practices to keep your rollout safe and effective. Ensure SOC2 compliance for sensitive data, start junior engineers with AI-assisted investigations for faster onboarding, and maintain a composable architecture that adapts to your specific tech stack.

Tool Setup Time Accuracy Triage Reduction
Struct 10 minutes Varies Varies
Cleric.ai Weeks Varies Varies
Claude/ChatGPT Manual Varies Varies
Open-source Days Varies Varies

Accelerate AI SRE On-Call with Struct

These 7 steps turn reactive fire-fighting into proactive AI-powered investigations. AWS DevOps Agent customers report faster investigations and lower MTTR. Autonomous agents now handle the heavy lifting while your team focuses on high-value decisions.

Stop 3AM Log Hunts and Set Up Struct in 10 Minutes

FAQ

What is an AI SRE agent and how does it differ from open-source options?

An AI SRE agent automatically investigates alerts by pulling logs, correlating traces, and generating root cause hypotheses without human intervention. Unlike open-source tools that require days of setup and manual configuration, Struct provides a proactive platform that deploys in 10 minutes and learns your specific architecture patterns for increasingly accurate investigations over time.

How long does it take to set up automated AI SRE investigations?

Struct connects to your existing Slack, GitHub, and observability tools in under 10 minutes. You authenticate your integrations, configure alert channels, and then start receiving automated investigation reports immediately. You avoid lengthy enterprise deployments and complex indexing.

Is this compliant with HIPAA and SOC2 requirements?

Yes, Struct is fully SOC2 Type II and HIPAA compliant. Your logs are processed ephemerally without persistent storage, which meets strict enterprise security requirements while still enabling automated investigations.

What if our logging and telemetry infrastructure needs improvement?

AI agents need basic observability foundations such as Datadog metrics, structured logs with correlation IDs, and alert triggers. If your system lacks fundamental logging or trace IDs, focus on improving data quality first. Struct works best with teams that already use modern observability platforms.

Can junior engineers safely handle on-call with AI assistance?

Yes. AI agents provide strong starting points for every alert, including context, timeline, and suggested next steps. This support allows newer engineers to take on-call shifts confidently while they learn your system architecture through guided investigations.

How does this compare to Datadog’s built-in AI features?

Datadog’s Bits AI provides investigation capabilities inside their platform. Struct adds seamless Slack integration, custom runbook encoding, and automated handoffs to GitHub for end-to-end incident resolution without leaving your communication hub. Together, these features create a more complete incident workflow.