Imagine a large-scale IoT deployment where hundreds of thousands of devices are operating across the field. In our customer's case, each device sends a "heartbeat" signal every 10 minutes to indicate that it is alive and connected. These signals are essential for monitoring the overall health of the fleet and detecting any devices that may have gone offline.
In this blog, we will walk through how we built a fully serverless, real-time processing pipeline, from device heartbeat ingestion to alerting and historical storage, using AWS services like SNS, SQS, Kinesis, Flink, Redis, Firehose, and OpenSearch.
A customer-managed EC2 app, called Ingress, acts as the initial receiver of the heartbeats from the IoT devices. While we treat it as a black box, meaning we do not manage or interact with its internals, its role is to collect the incoming heartbeats and forward them to other AWS services that are part of our Heartbeats architecture.
We will start with a high-level walkthrough of how each AWS service is connected to form the pipeline. After that, we will dive deeper into the rationale behind specific design choices, challenges and trade-offs we encountered along the way.
Ingress publishes heartbeats to an SNS topic — a messaging service that enables decoupling and fan-out patterns in AWS. Currently, the only subscriber is an SQS queue.
Why SNS? Based on the customer’s need for future extensibility, we agreed that introducing an SNS topic would offer the most flexibility, allowing additional consumers to be added later without requiring changes to Ingress.
We use an EventBridge pipe to move messages from SQS to Kinesis. The pipe:
The heartbeats flow into a Kinesis stream configured in an on-demand mode: a fitting choice that avoids overprovisioning, removes the need for manual shard management, and keeps the system ready for future scaling.
A Java-based Apache Flink application running on Amazon Kinesis Data Analytics processes the stream. This is a managed stream processing environment that supports real-time analytics at scale. Key responsibilities:
Since Flink is multi-threaded and each thread processes only a portion of the stream, we cannot store state in local memory. Doing so would prevent us from maintaining a consistent view of the latest status for each device across the entire fleet. To reliably track whether a device is online or offline based on its latest heartbeat, we need a centralized state store accessible to all threads. Instead:
When a device's connectivity status changes, Flink:
We wired the SNS topic directly to a Firehose delivery stream:
Processed state change events are:
Although technically out of scope for the pipeline, we extended visibility by layering Grafana dashboards on top of OpenSearch. This helps stakeholders track device availability, identify trends, and detect anomalies in real time.
Now that we have walked through the overall structure of the pipeline, we will take a closer look at specific design choices and the challenges that led us to adapt certain parts of our architecture.
Early in the design, we needed to decide whether Ingress should publish heartbeat messages to an SQS queue, an SNS topic, or chain one to the other. Each option carried different trade-offs in terms of flexibility and operational simplicity.
After the customer expressed a desire for future fan-out capabilities, we recommended using an SNS topic as the initial publishing target. This approach offered the flexibility to add more consumers down the line without requiring any changes to Ingress.
At the same time, introducing an SQS queue downstream of the SNS topic brought several practical benefits that SNS alone could not offer:
Message ordering was also not a concern in this setup. Since each device sends a heartbeat once every 10 minutes, any previous message from the same device will have already been processed by the time a new one arrives.
Lastly, the customer was already familiar with SQS and had a positive experience with it in past projects, which made the decision even more natural, given that the initial expectation was that they would eventually inherit the project’s management.
In short, chaining SNS to SQS gave us the flexibility the customer needed and the reliability, observability, and resilience our pipeline required — with minimal complexity.
Once messages are published to the SNS topic and delivered to the SQS queue, they are picked up by an EventBridge pipe that forwards them into a Kinesis Data Stream. This pipe is more than just a connector: it plays an active role in controlling the quality and shape of the data flowing through the pipeline.
First, the pipe applies a filter that ensures only well-formed heartbeats are forwarded. Specifically, it checks for the presence of both the SN (serial number) and TM (timestamp) fields in the message body. Any message that does not include both fields is automatically discarded. This gives us a clean enforcement point for schema expectations without requiring additional validation logic elsewhere.
Next, the pipe applies a transformation step. It simplifies the payload by extracting and passing along only the SN and TM fields, discarding any unrelated or verbose metadata. This results in a smaller, more focused message format downstream, which improves performance and reduces parsing effort for the Flink application.
The transformed messages are then sent into a Kinesis Data Stream, which is configured in on-demand mode. This allows the stream to automatically scale its number of shards based on throughput, eliminating the need to preallocate or monitor shard count manually. Since traffic arrives in a predictable, bursty pattern every 10 minutes, this dynamic scaling is both cost-effective and operationally lightweight.
The partition key for the stream is set to the device’s serial number (SN), meaning that all heartbeats from the same device are routed to the same shard.
Overall, this design allowed us to enforce message quality, reduce noise, streamline data, and scale ingestion — all without writing or maintaining custom logic.
When implementing the heartbeat processor in Flink, we initially believed that the partitioning strategy would naturally map each device to a consistent Java thread. Since the partition key was the device's serial number (SN), and that key determines the Kinesis shard, we assumed that each Java thread would always process the same set of devices.
Had that been the case, we could safely use Java’s in-memory state per thread to track device status (e.g., last known heartbeat, online/offline status), since each device’s updates would always go to the same thread.
That assumption turned out to be incorrect.
While Flink guarantees event ordering per partition, it does not guarantee thread-level consistency. In practice, devices mapped to a given partition (Kinesis shard) can still be processed by different Java threads. As a result, state stored in one thread's memory would not be accessible to another, leading to fragmented, inconsistent tracking of heartbeat status across the system.
To address this, we moved the heartbeat state into an external, centralized data store.
Redis (and specifically Valkey) was chosen as that store for several reasons:
We also considered using DynamoDB as an alternative, since it offers high availability, persistence, and native AWS integration. However, for our use case, the higher latency, throughput cost, and operational complexity (e.g., handling rate limits and provisioned capacity) were not justified. Redis, by contrast, offered the low-latency access and operational simplicity we needed. With replication enabled, it also provided sufficient fault tolerance for our requirements, despite not being a persistent store.
By moving the state out of the Flink runtime and into Redis, we gained a globally consistent view of all devices' latest status, regardless of which thread processed their events. This shift was critical to achieving accurate and reliable state transitions across the fleet.
In a long-running stream processing application, restarts are inevitable: whether due to deployment changes, scaling, or failure recovery. In our setup, the Flink application is configured to always restart from the beginning of the stream, as defined by the TRIM_HORIZON setting and including the last 24 hours of data. This means that on every restart, the app reprocesses all available data in the Kinesis stream, including heartbeats that may have already been handled.
Without additional safeguards, this behaviour could lead to false alerts or duplicate state transitions being detected from historical data. To prevent that, we introduced a presentTime flag in the application logic. This flag remains disabled while old data is being replayed and only activates once the system detects that it is receiving current (i.e., fresh) heartbeats.
Each heartbeat object also inherits contextual metadata, such as the time of the last state change and whether a disconnection has already been reported. Out-of-order or outdated heartbeats are silently dropped. Redis serves not only as shared memory, but also as a control point for determining whether a device's state has truly changed. Only real transitions (e.g., from connected to disconnected or vice versa) result in publishing to SNS or writing to S3, keeping downstream systems clean and focused.
Designing a reliable, real-time telemetry pipeline for hundreds of thousands of IoT devices is not just about choosing the right AWS services: it is about making those services work together thoughtfully under real-world conditions.
What began as a relatively straightforward data ingestion problem quickly turned into a deeper engineering challenge: how to ensure message integrity at scale, how to maintain consistent state in a distributed system, and how to avoid flooding downstream systems with redundant or stale data.
Along the way, we learned that:
The end result is a system that is flexible, resilient, and operationally lean. It gives our customer a complete picture of device activity, including real-time status and historical trends, without requiring constant manual intervention.
This pipeline isn't just scalable on paper — it has already handled hundreds of millions of heartbeats in production with high accuracy and low noise.
Thank you for taking the time to read this deep dive into our heartbeat processing pipeline! I hope you found it insightful and maybe even picked up an idea or two. If you have any questions, feedback, or just want to chat about similar challenges, feel free to reach out — I would be glad to continue the conversation!