How we built a real-time heartbeat processing pipeline at scale in AWS

Written by Ilias Merentitis | Jul 23, 2025 12:12:39 PM

Introduction

Imagine a large-scale IoT deployment where hundreds of thousands of devices are operating across the field. In our customer's case, each device sends a "heartbeat" signal every 10 minutes to indicate that it is alive and connected. These signals are essential for monitoring the overall health of the fleet and detecting any devices that may have gone offline.

In this blog, we will walk through how we built a fully serverless, real-time processing pipeline, from device heartbeat ingestion to alerting and historical storage, using AWS services like SNS, SQS, Kinesis, Flink, Redis, Firehose, and OpenSearch.

Architecture Overview

A customer-managed EC2 app, called Ingress, acts as the initial receiver of the heartbeats from the IoT devices. While we treat it as a black box, meaning we do not manage or interact with its internals, its role is to collect the incoming heartbeats and forward them to other AWS services that are part of our Heartbeats architecture.

We will start with a high-level walkthrough of how each AWS service is connected to form the pipeline. After that, we will dive deeper into the rationale behind specific design choices, challenges and trade-offs we encountered along the way.

Step-by-Step Overview

Step 1: Ingress to SNS

Ingress publishes heartbeats to an SNS topic — a messaging service that enables decoupling and fan-out patterns in AWS. Currently, the only subscriber is an SQS queue.

Why SNS? Based on the customer’s need for future extensibility, we agreed that introducing an SNS topic would offer the most flexibility, allowing additional consumers to be added later without requiring changes to Ingress.

Step 2: SQS to Kinesis via EventBridge Pipe

We use an EventBridge pipe to move messages from SQS to Kinesis. The pipe:

Filters out heartbeats missing the SN (Serial Number) or TM (Timestamp)
Transforms the message to simplify downstream handling
Partitions by SN to ensure even load distribution across Kinesis shards

Step 3: Kinesis Stream

The heartbeats flow into a Kinesis stream configured in an on-demand mode: a fitting choice that avoids overprovisioning, removes the need for manual shard management, and keeps the system ready for future scaling.

No shard management required
24-hour retention
Partitioned by device serial number

Step 4: Flink Application

A Java-based Apache Flink application running on Amazon Kinesis Data Analytics processes the stream. This is a managed stream processing environment that supports real-time analytics at scale. Key responsibilities:

Deduplicates heartbeats by checking if they are newer than previously seen values
Determines wether a device is currently connected or disconnected
Uses a presentTime switch to avoid false alerts during backfill/restarts

Step 5: Redis State Store

Since Flink is multi-threaded and each thread processes only a portion of the stream, we cannot store state in local memory. Doing so would prevent us from maintaining a consistent view of the latest status for each device across the entire fleet. To reliably track whether a device is online or offline based on its latest heartbeat, we need a centralized state store accessible to all threads. Instead:

We use Valkey (Redis 8.0) as a centralized, thread-safe state store
Every thread checks Redis to determine the last known state per device
Redis runs in private subnets with a Route 53 DNS record and tight security group

Step 6: SNS Publication on State Change

When a device's connectivity status changes, Flink:

Publishes a structured HeartbeatEvent to an SNS topic, so that other systems (such as alerting services or dashboards) can react to the change in real time
Writes it to S3 for historical tracking and audit purposes
Avoids publishing for repeated states (e.g., if a device stays offline, no duplicate alerts), ensuring that only meaningful changes trigger downstream actions

Step 7: SNS to Firehose

We wired the SNS topic directly to a Firehose delivery stream:

Native integration, so there is no need to manage a Lambda function for the handoff, which reduces complexity, operational overhead, and potential points of failure
Firehose handles buffering and delivery to OpenSearch

Step 8: OpenSearch

Processed state change events are:

Indexed into OpenSearch for real-time monitoring and search

Bonus: Monitoring with Grafana

Although technically out of scope for the pipeline, we extended visibility by layering Grafana dashboards on top of OpenSearch. This helps stakeholders track device availability, identify trends, and detect anomalies in real time.

Deep Dive

Now that we have walked through the overall structure of the pipeline, we will take a closer look at specific design choices and the challenges that led us to adapt certain parts of our architecture.

Point 1: SNS, SQS, or both?

Early in the design, we needed to decide whether Ingress should publish heartbeat messages to an SQS queue, an SNS topic, or chain one to the other. Each option carried different trade-offs in terms of flexibility and operational simplicity.

After the customer expressed a desire for future fan-out capabilities, we recommended using an SNS topic as the initial publishing target. This approach offered the flexibility to add more consumers down the line without requiring any changes to Ingress.

At the same time, introducing an SQS queue downstream of the SNS topic brought several practical benefits that SNS alone could not offer:

Durability: SQS guarantees that messages are stored until they are successfully consumed or expire. This ensures no data is lost if downstream processing is delayed or temporarily unavailable.
Controlled consumption: Unlike SNS, which pushes messages immediately, SQS enables consumers to pull messages in batches, optimizing throughput and reducing API calls.
Retry and error handling: SQS natively supports retry logic and integration with dead-letter queues, giving us better control over failed or malformed messages.
Visibility and observability: SQS provides detailed metrics and APIs to monitor queue depth, age of oldest message, and overall processing health.

Message ordering was also not a concern in this setup. Since each device sends a heartbeat once every 10 minutes, any previous message from the same device will have already been processed by the time a new one arrives.

Lastly, the customer was already familiar with SQS and had a positive experience with it in past projects, which made the decision even more natural, given that the initial expectation was that they would eventually inherit the project’s management.

In short, chaining SNS to SQS gave us the flexibility the customer needed and the reliability, observability, and resilience our pipeline required — with minimal complexity.

Point 2: EventBridge Pipe and Kinesis - Filtering, Transforming, and Scaling

Once messages are published to the SNS topic and delivered to the SQS queue, they are picked up by an EventBridge pipe that forwards them into a Kinesis Data Stream. This pipe is more than just a connector: it plays an active role in controlling the quality and shape of the data flowing through the pipeline.

First, the pipe applies a filter that ensures only well-formed heartbeats are forwarded. Specifically, it checks for the presence of both the SN (serial number) and TM (timestamp) fields in the message body. Any message that does not include both fields is automatically discarded. This gives us a clean enforcement point for schema expectations without requiring additional validation logic elsewhere.

Next, the pipe applies a transformation step. It simplifies the payload by extracting and passing along only the SN and TM fields, discarding any unrelated or verbose metadata. This results in a smaller, more focused message format downstream, which improves performance and reduces parsing effort for the Flink application.

The transformed messages are then sent into a Kinesis Data Stream, which is configured in on-demand mode. This allows the stream to automatically scale its number of shards based on throughput, eliminating the need to preallocate or monitor shard count manually. Since traffic arrives in a predictable, bursty pattern every 10 minutes, this dynamic scaling is both cost-effective and operationally lightweight.

The partition key for the stream is set to the device’s serial number (SN), meaning that all heartbeats from the same device are routed to the same shard.

Overall, this design allowed us to enforce message quality, reduce noise, streamline data, and scale ingestion — all without writing or maintaining custom logic.

Point 3: Rethinking Global State

When implementing the heartbeat processor in Flink, we initially believed that the partitioning strategy would naturally map each device to a consistent Java thread. Since the partition key was the device's serial number (SN), and that key determines the Kinesis shard, we assumed that each Java thread would always process the same set of devices.

Had that been the case, we could safely use Java’s in-memory state per thread to track device status (e.g., last known heartbeat, online/offline status), since each device’s updates would always go to the same thread.

That assumption turned out to be incorrect.

While Flink guarantees event ordering per partition, it does not guarantee thread-level consistency. In practice, devices mapped to a given partition (Kinesis shard) can still be processed by different Java threads. As a result, state stored in one thread's memory would not be accessible to another, leading to fragmented, inconsistent tracking of heartbeat status across the system.

To address this, we moved the heartbeat state into an external, centralized data store.

Redis (and specifically Valkey) was chosen as that store for several reasons:

It provides fast, in-memory read/write access, which is critical for real-time processing.
It supports simple key-value semantics, making it a natural fit for mapping device IDs to heartbeat metadata.
It is mature, battle-tested, and easy to integrate into a JVM-based application.
Valkey is fully open source under a permissive BSD license, which avoids the licensing restrictions of newer Redis versions and helps keep managed service costs down.

We also considered using DynamoDB as an alternative, since it offers high availability, persistence, and native AWS integration. However, for our use case, the higher latency, throughput cost, and operational complexity (e.g., handling rate limits and provisioned capacity) were not justified. Redis, by contrast, offered the low-latency access and operational simplicity we needed. With replication enabled, it also provided sufficient fault tolerance for our requirements, despite not being a persistent store.

By moving the state out of the Flink runtime and into Redis, we gained a globally consistent view of all devices' latest status, regardless of which thread processed their events. This shift was critical to achieving accurate and reliable state transitions across the fleet.

Point 4: Java clean-ups

In a long-running stream processing application, restarts are inevitable: whether due to deployment changes, scaling, or failure recovery. In our setup, the Flink application is configured to always restart from the beginning of the stream, as defined by the TRIM_HORIZON setting and including the last 24 hours of data. This means that on every restart, the app reprocesses all available data in the Kinesis stream, including heartbeats that may have already been handled.

Without additional safeguards, this behaviour could lead to false alerts or duplicate state transitions being detected from historical data. To prevent that, we introduced a presentTime flag in the application logic. This flag remains disabled while old data is being replayed and only activates once the system detects that it is receiving current (i.e., fresh) heartbeats.

Each heartbeat object also inherits contextual metadata, such as the time of the last state change and whether a disconnection has already been reported. Out-of-order or outdated heartbeats are silently dropped. Redis serves not only as shared memory, but also as a control point for determining whether a device's state has truly changed. Only real transitions (e.g., from connected to disconnected or vice versa) result in publishing to SNS or writing to S3, keeping downstream systems clean and focused.

Conclusion

Designing a reliable, real-time telemetry pipeline for hundreds of thousands of IoT devices is not just about choosing the right AWS services: it is about making those services work together thoughtfully under real-world conditions.

What began as a relatively straightforward data ingestion problem quickly turned into a deeper engineering challenge: how to ensure message integrity at scale, how to maintain consistent state in a distributed system, and how to avoid flooding downstream systems with redundant or stale data.

Along the way, we learned that:

Partitioning is not the same as thread affinity: state must be managed globally, not per Flink thread.
EventBridge pipes can dramatically reduce boilerplate by handling filtering, transformation, and routing declaratively.
External state (via Redis) is essential when working with stateful logic in multi-threaded environments.
Publishing only on real state transitions keeps logs, alerts, and dashboards clean and focused.
Kinesis Firehose, OpenSearch, and S3 work best when each serves its natural role: delivery, visibility, and retention.

The end result is a system that is flexible, resilient, and operationally lean. It gives our customer a complete picture of device activity, including real-time status and historical trends, without requiring constant manual intervention.

This pipeline isn't just scalable on paper — it has already handled hundreds of millions of heartbeats in production with high accuracy and low noise.

Thank you for taking the time to read this deep dive into our heartbeat processing pipeline! I hope you found it insightful and maybe even picked up an idea or two. If you have any questions, feedback, or just want to chat about similar challenges, feel free to reach out — I would be glad to continue the conversation!

View full post