Written by: Nimesh Chakravarthi, Co-founder & CTO, Struct
Key Takeaways
- Monitor critical Kafka metrics like consumer lag (>100k), under-replicated partitions (>0), and offline partitions for real-time issue detection.
- Set up alerts in AWS MSK with CloudWatch, in Confluent Cloud with the UI, and in Datadog or Grafana with Prometheus for multi-cloud coverage.
- Route notifications to Slack, PagerDuty, or webhooks using SNS integrations so the right team sees alerts without alert fatigue.
- Reduce false positives by using evaluation periods (5-15 minutes), rate conditions, and by accounting for consumer rebalances and idle lag.
- Automate your on-call runbook with Struct to cut Kafka triage from 45 minutes to 5 minutes across CloudWatch, Datadog, and Slack.
Kafka Metrics That Deserve Immediate Alerts
Alert on Kafka metrics that signal real user impact or data loss risk. Critical Kafka metrics for real-time alerting include UnderReplicatedPartitions (>0 for 5 min), OfflinePartitionsCount (>0 page immediately), ActiveControllerCount (≠1 critical), Consumer Lag (>100k or rising rapidly), and ISR Shrink Rate (>1/sec sustained).
|
Metric |
Why Alert |
Threshold Example (2026) |
Tools |
|
Consumer Lag |
Consumers fall behind |
>100k msgs/5min or rate>50k/min |
CloudWatch/Datadog |
|
Under-Replicated Partitions |
Risk of data loss |
>0/5min |
Prometheus/Grafana |
|
High Latency/CPU |
Broker overload |
P99>500ms, CPU>80% |
Confluent/Datadog |
|
Offline Partitions |
Availability impact |
>0 immediate |
MSK Alerts |
Treat offline partitions and controller issues as P0 with 15 second intervals, and treat under-replication, ISR shrink, and consumer lag as P1 with 30 second intervals. Track throughput metrics like messagesinpersec, bytes in and out per second, and produce or fetch requests per second using 5 minute rates, along with latency metrics such as P99 produce and fetch request latency, request queue time, and log flush latency.
Quick Start Checklist: 5 Steps to Kafka Alerts
- Identify critical metrics such as consumer lag >100k messages and offline partitions >0.
- Choose a monitoring tool, for example CloudWatch for MSK or Datadog for multi-cloud environments.
- Set thresholds with rate conditions, for example rate(lag[5m])>0 for sustained lag growth.
- Route notifications to Slack or PagerDuty for fast on-call response.
- Test alerts and add AI triage automation to speed up investigations.
Kafka Alert Setup Across AWS MSK, Confluent Cloud, Datadog, and Grafana
Each cloud platform exposes Kafka metrics differently, so you configure alerts slightly differently on each one. Use the table below as a quick reference for real-time alert setup across major providers.
|
Cloud |
Metrics Source |
Alert Steps |
Notification |
|
AWS MSK |
CloudWatch |
Enable enhanced monitoring, then create alarms |
SNS/Slack |
|
Confluent Cloud |
Metrics API |
Use UI rules or ksqlDB streams |
Webhook |
|
Datadog |
Agent |
Configure YAML and dashboards |
Slack |
|
Grafana |
Prometheus |
Create panel queries with AlertManager |
Multiple |
AWS MSK Alerts with CloudWatch Alarms
Create CloudWatch alarms in AWS MSK for real-time monitoring on metrics such as UnderReplicatedPartitions > 0, KafkaDataLogsDiskUsed > 80%, ActiveControllerCount != 1, and OfflinePartitionsCount > 0.
{ “AlarmName”: “MSK-ConsumerLag”, “MetricName”: “ConsumerLag”, “Namespace”: “AWS/Kafka”, “Threshold”: 100000, “ComparisonOperator”: “GreaterThanThreshold”, “EvaluationPeriods”: 2, “Period”: 300, “AlarmActions”: [“arn:aws:sns:us-east-1:123456789:kafka-alerts”] }
Harden your MSK cluster with settings such as auto.create.topics.enable=false, default.replication.factor=3, and min.insync.replicas=2 for reliability before you rely on alerts.
Confluent Cloud Alert Rules for Kafka
Open the Confluent Cloud Console, then go to Alerts and select Create Alert Rule. Configure consumer lag alerts with thresholds that match your traffic patterns and recovery expectations. Tableflow and Stream Designer support direct writes to Iceberg or Delta Lake from Kafka with exactly-once guarantees, so you should review and update monitoring as your data paths evolve.
Datadog and Grafana Kafka Alert Examples
Track consumer lag with metrics such as kafka_consumergroup_lag, lag_seconds, maximum lag by group or topic, and an increasing lag rate using expressions like rate(kafka_consumergroup_lag[5m]) > 0.
– alert: KafkaConsumerLagHigh expr: kafka_consumergroup_lag > 100000 for: 5m labels: severity: warning annotations: summary: “Kafka consumer lag is high”
Automate your on-call runbook so these alerts trigger automatic investigation instead of manual digging.
Routing Kafka Alerts to Slack, PagerDuty, and Observability Tools
Alert routing should send Kafka incidents to the right people on the right channels. Configure CloudWatch SNS topics to send webhook notifications directly into Slack channels for your on-call teams. For PagerDuty integration, create a service integration and route Kafka events through the PagerDuty Events API.
Datadog and Grafana Kafka Dashboards That Teams Use
Cloud notification systems expanded in 2025 with Slack integration for billing events, scaling operations, service version upgrades, and monitoring that includes automatic failure alerts. Modern Kafka dashboards should highlight consumer lag trends, broker health metrics, and partition distribution so engineers can spot problems quickly.
Struct connects directly with Slack, Datadog, CloudWatch, and GitHub to join war rooms automatically when alerts fire, and it provides instant context and investigation results.
Fixing Noisy Kafka Alerts and Scaling Your Setup
Kafka alert noise often comes from false positives. A high consumer group join rate signals instability that can trigger false lag alerts, usually from consumer crashes, aggressive session timeouts, or long-running message processing that exceeds max.poll.interval.ms.
Use a for duration such as 5 to 15 minutes in alert rules to cut noise from short lag spikes. Remember that consumer lag metrics freeze during rebalances, which creates false stale high lag readings, and lag often jumps after the rebalance finishes.
Common pitfalls include:
- Alerting on temporary rebalance lag spikes
- Ignoring idle consumers in lag calculations
- Using session timeouts that are too aggressive
- Missing partition-level lag skew
Why Struct Handles Kafka Alerts Better
Struct correlates Kafka metrics with logs, code changes, and deployment events so you see root causes quickly. Teams that usually spend 45 minutes jumping between CloudWatch and Datadog can get investigation results in under 5 minutes with Struct, while keeping SOC2 and HIPAA compliance and using composable runbooks that match their Kafka setup.
Reduce triage time by 80% – Connect Struct in 10 minutes
Next Steps for Reliable Kafka Alerting
Reliable Kafka alerting depends on tracking the right metrics with clear thresholds, routing notifications intelligently, and automating triage to limit alert fatigue. Start with consumer lag and partition health alerts, then add throughput and latency monitoring as your system grows.
The next stage uses AI to automate investigations. Alerts tell you that something broke, and tools like Struct investigate the root cause, correlate related events, and present actionable dashboards, which frees your team from 3 AM firefights.
Set Up Struct Free – Automate Kafka Triage Now
Kafka Alerts FAQ
How do I alert on Kafka consumer lag in AWS MSK?
Enable enhanced monitoring in your MSK cluster, then create CloudWatch alarms on the ConsumerLag metric. Set thresholds based on your message volume, and treat >100,000 messages for 5 minutes as a likely problem. Use SNS to route alerts to Slack or PagerDuty. Configure the alarm with evaluation periods so you avoid false positives during normal consumer rebalancing.
How long does Confluent Cloud real-time alerts setup take?
Basic consumer lag and throughput alerts in Confluent Cloud usually take under 10 minutes to configure in the web UI. Go to the Alerts section, create rules for your topics and consumer groups, set thresholds that match your workloads, and configure webhook notifications. For more complex setups that use ksqlDB streams, plan for 20 to 30 minutes to test and validate behavior.
How does Struct handle poor Kafka logging and correlation?
Struct correlates sparse Kafka logs with metrics from CloudWatch, Datadog, and other observability tools to build a complete picture. It identifies patterns across consumer groups, topics, and brokers even when individual log entries lack context. The AI fills gaps by analyzing code changes, deployment events, and infrastructure metrics so it can surface root causes that manual investigation might miss.
Is Struct secure for sensitive Kafka logs and data?
Struct maintains SOC2 and HIPAA compliance with ephemeral log processing, so your data is accessed and analyzed temporarily without persistent storage. All integrations use secure APIs with proper authentication, and the system automatically redacts sensitive information. For most Seed to Series C companies, this compliance level meets security requirements without needing on-premise deployment.
Can I customize Struct for my specific Kafka runbooks and procedures?
Struct supports customization for your Kafka runbooks and operating procedures. You can define custom correlation IDs, investigation steps, and company-specific runbooks. You also control how the AI investigates different Kafka alerts, which metrics it should prioritize, and what context it must collect. This approach keeps investigations aligned with your team’s process while automating the manual work.