How to Set Up Application Performance Monitoring (APM)

How to Set Up Application Performance Monitoring (APM)

Written by: Nimesh Chakravarthi, Co-founder & CTO, Struct

Key Takeaways

  • APM setup in 2026 starts with auditing existing metrics, traces, and logs to find gaps before you add new agents.

  • Standardize on OpenTelemetry and W3C Trace Context so every request carries correlation IDs across services, queues, and logs.

  • Replace noisy static thresholds with anomaly-based alerts that track p95 latency, error rate, and saturation.

  • Connect GitHub and your observability tools so Struct can correlate deploys, traces, logs, and exceptions into a single Slack investigation.

  • Struct turns your on-call runbook into an automated workflow so each alert triggers a root-cause investigation in Slack within minutes.

Step 1: Map Your Current Observability Stack

Start by auditing the telemetry your systems already emit. Every effective APM setup rests on three pillars: metrics (p50/p95/p99 latency, error rate, requests per second, CPU and memory saturation), traces (end-to-end request paths across microservices), and logs (structured event records with correlation IDs). A practical APM setup in 2026 begins by defining SLIs and SLOs, for example p95 latency under 300 ms and error rate below 1%.

List every service, its language runtime, and its current data source. Note gaps, because services emitting no traces are blind spots. Once you see those gaps, you can connect observability tools that correlate data across them. Struct connects to Datadog, Sentry, AWS CloudWatch, GCP Logs, Azure Traces, Grafana, Prometheus/Loki, and GitHub in under 10 minutes, which covers the integration surface you just mapped.

# Inventory template — fill one row per service service: checkout-api language: Node.js metrics_source: Datadog trace_source: none # gap — instrument in Step 2 log_source: AWS CloudWatch correlation_id_field: x-request-id

Step 2: Install APM Agents with OpenTelemetry

OpenTelemetry has been adopted by nearly half of all organizations as the vendor-neutral instrumentation standard, with another 26% currently evaluating it. Use OTel SDKs where possible, because they keep your backend choice reversible without re-instrumenting code later.

For a Node.js service exporting to Datadog or any OTLP-compatible backend:

npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http # Environment variables (set in your deployment manifest) OTEL_SERVICE_NAME=checkout-api OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production OTEL_EXPORTER_OTLP_ENDPOINT=https://api.datadoghq.com/api/v1/traces DD_API_KEY=<your_key>

For teams already on Datadog, the dd-trace auto-instrumentation library requires only two environment variables. For AWS-native stacks, the AWS Distro for OpenTelemetry (ADOT) collector runs as a sidecar and forwards traces to X-Ray or any OTLP endpoint. Regardless of which approach you choose, the performance impact stays minimal. APM agents typically introduce 5-10% CPU overhead and up to 10% memory overhead in production, with sampling rates affecting the exact amount, which remains negligible at most scales.

Step 3: Enable Distributed Tracing and Correlation IDs

Distributed tracing and service dependency maps let teams follow a request end-to-end across microservices, which is mandatory in architectures where a single user request may pass through API gateways, authentication services, backend microservices, message queues, databases, and external APIs. Struct uses these trace links to auto-correlate events across your stack without manual log-grepping.

Standardize trace propagation using W3C Trace Context headers across all services and queues. Inject the traceparent header at every service boundary:

# Express.js middleware — propagate W3C Trace Context app.use((req, res, next) => { const traceParent = req.headers['traceparent'] || generateTraceParent(); res.setHeader('traceparent', traceParent); req.correlationId = parseTraceId(traceParent); next(); }); # Log every request with the correlation ID logger.info({ correlationId: req.correlationId, path: req.path });

Once correlation IDs flow through logs and traces, Struct can automatically link a Sentry exception to the Datadog trace and the CloudWatch log line that caused it, which removes the manual pivot that consumes 20 minutes of every 3 a.m. investigation. Put that correlation to work immediately by connecting Struct to your observability stack.

Step 4: Configure High-Signal Alerts for On-Call

Symptom-based alerting in 2026 focuses on p95 latency, error surges, and saturation metrics rather than broad threshold noise. Static thresholds generate the alert fatigue that makes on-call rotations unsustainable. Modern APM tools use AI-assisted root cause analysis that detects anomalies, reduces alert noise, and correlates logs, traces, and metrics.

Here is a high-signal Datadog monitor query for p95 latency:

# Datadog monitor — alert on p95 latency anomaly, not fixed threshold avg(last_5m):anomalies( p95:trace.express.request{env:production,service:checkout-api}, "agile", 3 ) >= 1 # Noise suppression: require 3 consecutive breaches before paging notify_no_data: false evaluation_delay: 60 renotify_interval: 0

Point your Datadog and Sentry alert webhooks at the Slack channel where Struct listens. From that moment, every alert Struct receives triggers an automatic investigation, without a human prompt. With tracing and correlation in place, these alerts kick off structured investigations instead of noisy pages, and Struct separates transient blips from customer-impacting outages before your phone rings.

Step 5: Connect GitHub and Encode Runbooks in Struct

Struct integrates with tools like Slack, GitHub, and observability platforms to enrich every investigation with code context. Connect your GitHub organization in Struct settings so Struct can correlate recent commits and pull requests with the incident timeline and surface the deployment that introduced a regression.

# Struct runbook input — paste directly into Struct's runbook editor alert_name: checkout-api-high-latency correlation_id_field: x-request-id investigation_steps: - Check Datadog APM for p95 latency by endpoint - Pull CloudWatch logs filtered by correlation ID - Cross-reference Sentry for concurrent exceptions - Review last 3 deploys in GitHub for checkout-api escalation_contact: oncall-payments@company.com

Struct follows these exact steps automatically when the alert fires. Senior engineers encode their tribal knowledge once, and every on-call engineer, including new hires, benefits from it on every incident. Encode your team’s expertise into Struct’s runbook system and remove the knowledge bottleneck from incident response.

Step 6: Validate the Full Pipeline in Production

Run a controlled synthetic incident to confirm the full pipeline works end to end. Trigger an intentional error in a non-critical service, then verify that Struct’s investigation appears in Slack within the expected timeframe.

# Synthetic incident trigger — Node.js app.get('/test-error', (req, res) => { throw new Error('Synthetic APM validation error — safe to ignore'); }); # Expected Struct output in Slack: # ✅ Root cause: Unhandled exception in /test-error # ✅ Blast radius: 0 affected users (non-critical path) # ✅ Correlated trace ID: abc123 → CloudWatch log line 4821 # ✅ Suggested fix: Add error boundary middleware

Confirm that the Struct dashboard shows the correlated trace, the relevant log lines, and a suggested fix. If the investigation does not appear, check that your Slack channel is configured as a Struct trigger source and that your observability integrations are authenticated. Once validated, MTTR for real incidents drops quickly, and large-scale customers report an 80% reduction in triage time.

Step 7: Measure Incident Outcomes and Iterate

Track three metrics from day one: triage time (time from alert fire to root cause identified), MTTR (time from alert fire to resolution), and onboarding speed (time for a new engineer to handle on-call independently). Weekly APM reviews should tackle top performance regressions and closely monitor new deployments, using these metrics as the guide.

# Monthly review checklist - [ ] Triage time baseline vs. current (target: <10 min) - [ ] MTTR baseline vs. current (target: >50% reduction) - [ ] Alert volume: signal-to-noise ratio improving? - [ ] New engineers handling on-call without escalation? - [ ] Runbooks updated for new services added this month? - [ ] Tail-based sampling enabled for error and slow traces?

Tail-based sampling is recommended for troubleshooting and incident response because it retains traces after the request outcome is known, preserving rare critical signals such as errors and slow calls. Enable it for your highest-severity services after your first monthly review so you keep the most valuable traces for future investigations.

Turning APM Data into Automated Investigations

A correctly configured APM stack generates the telemetry, and Struct converts that telemetry into answers. In 2026, observability platforms are integrating agentic AI that ingests observability data to analyze logs, extract patterns, find anomalies, and improve mean time to repair, and Struct provides that layer for Seed-to-Series C engineering teams.

When an alert fires, Struct automatically root-causes engineering alerts by pulling and analyzing metrics, logs, traces, monitors, and code. The dynamically generated dashboard appears in Slack within minutes and contains a unified timeline, blast radius summary, correlated evidence, and suggested fixes. Engineers ask follow-up questions directly in the Slack thread, with no tool-switching, no prompt engineering, and no copy-pasting logs into a generic AI. Struct is SOC 2 and HIPAA compliant, and setup takes under 10 minutes. Let Struct handle your next investigation before you finish reading the alert and get started in under 10 minutes.

FAQ

What is the minimum observability maturity required before Struct adds value?

Struct delivers the most value when a team already emits structured logs with correlation IDs, sends alerts to a Slack channel or PagerDuty, and has at least one observability tool connected, such as Datadog, Sentry, AWS CloudWatch, or GCP Logs. If your services produce no logs and fire no alerts, Struct cannot infer system state from code analysis alone. The practical minimum is alerts firing somewhere, logs queryable somewhere, and a GitHub repository connected. Most Seed-to-Series C teams already meet this bar.

How does Struct handle data residency and compliance requirements?

Struct is fully SOC 2 and HIPAA compliant. Logs and telemetry are accessed and processed ephemerally, and they are not stored permanently on Struct infrastructure. For the vast majority of U.S.-based Seed-to-Series C companies, this compliance posture satisfies security review. If your organization mandates that zero logs leave your VPC and requires full on-premise deployment, Struct’s Enterprise tier includes sidecar and on-prem support options, and you can contact the team to discuss your specific architecture.

What happens when Struct receives an alert but telemetry is sparse or malformed?

Struct investigation quality scales with the quality of the telemetry it receives. If logs are unstructured, correlation IDs are missing, or trace coverage is low, Struct surfaces whatever signal is available and flags the gaps explicitly in the investigation output. This still helps, because it identifies exactly which services need better instrumentation. Teams with sparse telemetry should treat the first two weeks with Struct as a telemetry audit, where Struct investigation gaps become a prioritized instrumentation backlog.

Can Struct follow our team’s specific on-call procedures rather than a generic investigation flow?

Yes. Struct’s composable runbook system lets teams paste their internal on-call runbooks directly into Struct configuration. You can specify correlation ID formats, define which log sources to query first, set escalation contacts, and configure composable widgets that guarantee specific charts always appear for specific alert types. The AI follows your operational procedures and encodes the institutional knowledge of your most experienced engineers into every automated investigation.

How quickly can a new engineer take on-call after Struct is set up?

Most teams report that new engineers can handle on-call independently within their first week after Struct is deployed. Because Struct performs the first-pass investigation automatically, correlating logs, mapping the blast radius, identifying root cause, and suggesting fixes, new engineers review a structured summary instead of starting from a blank terminal. They can ask Struct follow-up questions in Slack without needing deep systemic context. This removes the dependency on senior engineers for every incident and removes the onboarding bottleneck that keeps experienced engineers stuck in permanent on-call rotation.