Written by: Nimesh Chakravarthi, Co-founder & CTO, Struct
Key Takeaways for Your SigNoz Setup
- Manual log correlation across services at 3 a.m. consumes 40–60% of total MTTR, so faster correlation gives on-call teams the biggest win.
- A six-step checklist converts a raw Kubernetes cluster into a production-grade SigNoz log analysis pipeline with clear owners and outputs for each step.
- Deploying the OpenTelemetry Collector as a DaemonSet with early filtering and structured JSON logging creates queryable fields and cuts noise before data reaches SigNoz.
- Trace-log correlation via injected trace_id values lets you jump from any log line to the full distributed trace in SigNoz with one click, removing manual search.
- Connect SigNoz to Struct to receive automated root-cause dashboards in Slack before you finish reading this article.
Kubernetes DaemonSet Setup for the OpenTelemetry Collector
Goal: Collect container logs from every node, enrich them with Kubernetes metadata, and drop noise before it reaches SigNoz.
Owner: Platform or SRE team.
Required inputs: Kubernetes cluster with RBAC, SigNoz OTLP endpoint, and a namespace for the Collector.
Expected outputs: Structured log stream tagged with k8s.pod.name, k8s.namespace.name, and k8s.deployment.name.
Production trade-offs: A DaemonSet adds one Collector pod per node. The k8sattributes processor enriches logs with Kubernetes metadata by querying the Kubernetes API, which requires a ClusterRole. Bind OTLP receivers to the pod IP via the Downward API rather than 0.0.0.0 to limit network exposure.
receivers: filelog: include: [/var/log/containers/*.log] operators: - type: json_parser processors: k8sattributes: auth_type: serviceAccount filter/drop_debug: logs: exclude: match_type: strict severity_texts: ["DEBUG"] batch: send_batch_size: 512 timeout: 5s exporters: otlphttp: endpoint: https://<signoz-host>:4318 compression: gzip service: pipelines: logs: receivers: [filelog] processors: [k8sattributes, filter/drop_debug, batch] exporters: [otlphttp]
Place the batch processor after any filter processors so batching runs only on the final retained data set. Configure retry_on_failure with exponential backoff (initial_interval: 5s, max_interval: 30s) for transient backend errors.
With the Collector streaming raw logs from every node into SigNoz, you can now focus on making those logs easy to query through consistent JSON structure.
Structured JSON Logging and SigNoz Logs Pipelines
Goal: Normalize every log line into a queryable JSON schema inside SigNoz.
Owner: Application engineers, with SRE review.
Required inputs: Application logging library (e.g., zap, logrus, winston), SigNoz Logs Pipelines UI.
Expected outputs: Indexed fields for level, service.name, trace_id, span_id, http.status_code, and duration_ms.
Production trade-offs: Structured JSON enables easier querying and filtering by fields such as endpoint, error code, or latency. Unstructured logs rely on fragile regex parsing that breaks when formats change.
Use SigNoz Logs Pipelines with these steps:
- Add a JSON Parser processor to extract top-level fields.
- Add a Severity Parser to map
levelstrings to OpenTelemetry severity numbers. - Add a Timestamp Parser targeting the
timestampfield in RFC3339 format. - Add a Move processor to promote
trace_idandspan_idto log record attributes. - Add a Remove processor to drop high-cardinality fields like
request_bodythat inflate index size.
- type: json_parser parse_from: body - type: severity_parser parse_from: attributes.level - type: move from: attributes.trace_id to: attributes["trace_id"] - type: remove field: attributes.request_body
The Transform processor supports conditional logic and nested field access for complex JSON structures.
Trace-Log Correlation for Faster Root-Cause Diagnosis
Goal: Link every log line to its parent trace so a single click in SigNoz Logs Explorer jumps to the full distributed trace.
Owner: Application engineers instrumenting with OpenTelemetry SDKs.
Required inputs: OpenTelemetry SDK injecting trace_id and span_id into log context, SigNoz receiving both logs and traces.
Expected outputs: A “View Trace” button on every log line in the SigNoz Logs Explorer.
Production trade-offs: Correlating logs with traces using shared identifiers improves observability and enables faster incident investigation by linking alerts directly to specific log events. Without trace_id propagation, correlation stays manual and error-prone.
Inject trace context at the SDK layer:
# Python example using OTel logging bridge import logging from opentelemetry.instrumentation.logging import LoggingInstrumentor LoggingInstrumentor().instrument(set_logging_format=True) # Injects trace_id and span_id into every log record automatically
In SigNoz Logs Explorer, save a view with the filter trace_id IS NOT NULL AND level = ERROR and pin it to your team's on-call dashboard. A saved trace-correlated view removes the manual search phase that dominates incident response and replaces it with a single click into the right trace.
See how Struct auto-correlates traces with commits — Struct ingests these same trace_id values and connects them with GitHub commits and runbooks before you open the Logs Explorer.
Retention Strategy and Cost Controls in SigNoz
Goal: Keep hot query performance for recent incidents while meeting compliance minimums at the lowest practical storage cost.
Owner: Platform engineering and FinOps or engineering leadership.
Required inputs: Compliance requirements (SOC 2, HIPAA, PCI), estimated daily log volume in GB, SigNoz storage backend (ClickHouse).
Expected outputs: A tiered retention policy with documented cost per tier.
Production trade-offs: Keeping recent data in a hot tier on fast storage enables low-latency queries while moving older data to cheaper tiers reduces storage costs. Azure Monitor's analytics plan supports up to 730 days of analytics retention and up to 4,383 days total including long-term archive.
| Tier | Age Range | Storage Type | Query Method |
|---|---|---|---|
| Hot | 0–7 days | SSD / ClickHouse primary | Full Logs Explorer, real-time |
| Warm | 8–30 days | HDD / object storage | Full Logs Explorer, slightly slower |
| Cold | 31–90 days | Object storage (S3/GCS) | Search jobs, batch queries |
| Archive | 91–365+ days | Compressed object storage | Restore-on-demand for compliance |
An example ILM policy rolls an index at 50 GB, moves it to warm after 7 days, cold after 30 days, and deletes after 90 days. You can replicate this tiered approach in SigNoz by applying the same age thresholds to ClickHouse TTL settings. Reducing analytics retention below 31 days does not lower ingestion costs on most platforms, which is why the minimum practical hot tier is one week.
Error-Rate and Latency Alert Rules in SigNoz
Goal: Page on real user impact, not transient spikes.
Owner: SRE team with input from product on SLO targets.
Required inputs: Defined SLOs (e.g., 99.9% success rate, p99 latency < 500 ms), SigNoz alert rules UI or API.
Expected outputs: Two alert tiers, a PagerDuty page for acute burn and a Slack ticket for slow burn.
Production trade-offs: Google's SRE Workbook recommends the multiwindow, multi-burn-rate technique to balance precision, recall, detection time, and reset time. Single-window alerts often create false positives from short spikes.
| Alert Tier | Burn Rate | Windows | Routing |
|---|---|---|---|
| Page (critical) | 14.4× | 1 h + 5 min | PagerDuty |
| Page (high) | 6× | 6 h + 30 min | PagerDuty |
| Ticket (warning) | 3× | 24 h + 2 h | Slack #on-call |
| Ticket (low) | 1× | 3 d + 6 h | Slack #alerts |
alert: HighErrorBurnRate expr: | ( rate(http_requests_total{status=~"5.."}[1h]) / rate(http_requests_total[1h]) ) > (14.4 * 0.001) and ( rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) ) > (14.4 * 0.001) for: 2m labels: severity: page
Google SRE defines the four golden signals as latency, traffic, errors, and saturation, and recommends prioritizing these when only four metrics can be measured. Save a SigNoz Logs Explorer view filtered to level = ERROR AND http.status_code >= 500 and link it directly in the alert annotation so on-call engineers land on the right query instantly.
See automated alert investigation in action — when these alerts fire, Struct auto-investigates and delivers a root-cause dashboard to Slack in under five minutes.
From Reactive Queries to Automated Root-Cause Analysis
Goal: Remove the manual log-hunting phase so engineers review conclusions rather than gather evidence.
Owner: Engineering leadership and on-call rotation leads.
Required inputs: SigNoz structured logs with trace_id, alert routing to Slack or PagerDuty, GitHub repository access, internal runbooks.
Expected outputs: A Slack-native root-cause dashboard with blast radius, correlated log lines, relevant commits, and suggested fixes, delivered within five minutes of alert fire.
Production trade-offs: AI-assisted observability can significantly reduce root-cause diagnosis time. The trade-off is granting an automated system read access to logs and code; Struct is SOC 2 and HIPAA compliant and processes logs ephemerally.
The handoff works in a predictable sequence. When a SigNoz alert fires and routes to Slack, Struct intercepts the alert and queries the same structured logs and trace_id values already flowing through the OpenTelemetry Collector. It then cross-references the correlated trace with recent GitHub commits and applies your team's encoded runbook logic. AI-driven investigation compresses the diagnosis phase from hours to minutes by automatically correlating signals across tools. By the time the on-call engineer acknowledges the page, Struct has already posted a dynamically generated dashboard containing the impact scope, the offending log cluster, the probable root cause, and a suggested fix or PR.
A Series A fintech with 40+ engineers reduced triage time by 80% after connecting Struct, turning a 30–45-minute manual investigation into a five-minute review. Elite-performing engineering teams maintain MTTR below 60 minutes. Automating the diagnosis phase gives teams the fastest path to that benchmark. Junior engineers benefit equally because Struct acts as an automated senior engineer for the first pass, digesting company-specific runbooks so new hires can safely own on-call shifts without escalating every alert.
Frequently Asked Questions
Minimum Team Maturity for Using SigNoz with Struct
Teams need basic observability already in place. Applications should emit structured logs, ideally JSON, and alerts should route to Slack or PagerDuty. A code repository in GitHub must be available. SigNoz handles log ingestion and querying, while Struct needs those same logs plus alert triggers to run automated investigations. Teams still on unstructured plain-text logs should complete the Logs Pipelines step in this guide first. Once trace IDs appear in log lines and alerts route to Slack, Struct can be connected in under 10 minutes.
Actual Struct Setup Time
Struct setup typically takes 5–10 minutes. You authenticate three integrations: your issue source such as Slack or PagerDuty, your code repository such as GitHub, and your observability context such as SigNoz via OTLP or a cloud log provider. Once connected, auto-investigations activate immediately. Seed-to-Series C teams avoid lengthy enterprise deployments, indexing phases, or professional services engagements.
Struct Compliance with SOC 2 and HIPAA
Struct is fully SOC 2 and HIPAA compliant. Log data is accessed and processed ephemerally, and Struct does not store it permanently. For most Seed-to-Series C companies, these compliance standards cover contractual and regulatory requirements. Organizations that mandate full on-premise deployment with zero data leaving the VPC can contact the Struct team to discuss sidecar or on-prem support options on the Enterprise plan.
How Junior Engineers Use Struct Safely
Junior engineers can safely act on Struct's automated first-pass output. Struct encodes your team's on-call runbooks and correlation logic so every investigation output follows the same methodology a senior engineer would apply. Junior engineers receive a structured starting point that includes blast radius, root cause hypothesis, relevant log lines, and suggested next steps. They avoid facing a blank SigNoz query window at 3 a.m. The Slack-native conversational interface also lets them ask follow-up questions or test alternative hypotheses without leaving the alert thread. Struct reports an 85–90%+ helpful investigation rate across automated first-pass outputs.
Handling Poorly Structured Logs or Missing Trace IDs
Struct's accuracy depends on the quality of the telemetry it receives. Missing trace IDs reduce the depth of cross-service correlation, and unstructured log bodies limit field-level filtering. The Logs Pipelines and trace-log correlation sections of this guide address both gaps directly. Teams should treat the OpenTelemetry Collector configuration and JSON parser pipeline as prerequisites before enabling Struct auto-investigations. Struct still functions with partial telemetry, but the root-cause confidence score stays lower until structured fields and trace propagation are in place.
Next Configuration Change or Struct Pilot
The six steps in this guide move a Kubernetes team from raw container logs to a production-grade SigNoz pipeline with structured JSON, trace-log correlation, tiered retention, and multi-window burn-rate alerts. Each step compounds in value. The OpenTelemetry Collector DaemonSet feeds the Logs Pipelines, which produce the trace_id-enriched records that make alert saved views actionable. Those saved views then become the input Struct uses to deliver automated root-cause dashboards.
The gap between a well-configured SigNoz instance and a team that sleeps through the night is the automation layer. Good SLOs that measure reliability as experienced by customers provide the highest-quality indication for when an on-call engineer should respond, but responding still requires investigation. Struct closes that loop by performing the investigation automatically so engineers respond to conclusions rather than raw signals.
Stop burning senior engineers on 3 a.m. log-hunting expeditions. Pick the next YAML block from this guide to deploy today, then connect Struct to handle the investigation layer. Start reducing triage time by 80% with your next alert.