0. Introduction
Previously, the user notification logic was embedded directly in our business logic, with a failover sequence like PUSH → SMS → Alimtalk
. Each stage was blocking and sequential, which meant the entire notification process could take up to 15 seconds to complete.
In other words, the user might have to wait 15 seconds after placing an order.
To address this, our team decided to offload time-consuming tasks to a separate server and handle notification dispatching through a distributed and failover-capable MQ system.
We initially searched for Amazon MQ. It offered easy integration with AWS Client VPN and IAM, but the pricing? A single broker was projected to cost about 1,000,000won/month.
Too expensive. So we pivoted: “Let’s build an MQ ourselves using Redis Stream, which we were already using in production.”
Here’s what we needed to address:
- Event Rebalancing
- Fair Event Distribution
- DLQ (Dead Letter Queue)
- Consumer Lifecycle Management
Let’s go over each one.
1. Event Rebalancing
When a consumer reads an event using XREADGROUP
, Redis Stream stores it in the Pending Entry List (PEL), which means the message was delivered but not yet acknowledged (ACK).
If the consumer dies or fails to process the message, it stays in the PEL. After a certain idle_time
, other healthy consumers must take over using XAUTOCLAIM
or XCLAIM
. This is known as event rebalancing.
To implement this, consumers periodically scan the PEL and reprocess failed messages that have been idle too long.
2. Fair Event Distribution
Each of our consumers polls Redis every second using XREADGROUP
. The problem? If three consumers (C1, C2, C3) are running and their polling intervals are slightly offset, say:
- C1 at 1.0s
- C2 at 1.1s
- C3 at 1.2s
Then C1 ends up consuming most of the traffic. We visualized this to make the imbalance clear:
To resolve this, we explored two strategies:
Use a randomized polling interval per consumer (e.g., 1–1.3s)
- Easy to implement
- Doesn’t guarantee perfect balance
Use deterministic sharding:
aliveConsumerIds.minByOrNull { hash(eventId + it) } == myConsumerId
- First, fetch events via
XRANGE
- Calculate the hash for each event + consumer ID
- If I’m the assigned consumer, proceed with
XREADGROUP
- Prevents duplicates when combined with locking
- First, fetch events via
The general process looks like this:
- Reprocessing:
XPENDING → filter by hash → XREADGROUP → XACK
- New consumption:
XINFO GROUPS → XRANGE → filter by hash → XREADGROUP → XACK
Preventing Duplicates
Let’s say C1 is about to process event-a
because the hash-based filter selected it. Just before it locks and processes the event, a new consumer C2 joins. It also gets added to aliveConsumerIds
, and now when calculating the hash, it too becomes eligible for event-a
.
To avoid such duplication, C1 must acquire a lock before consuming, ensuring only one consumer processes an event.
3. Dead Letter Queue (DLQ)
Redis PEL tracks the number of times an event was delivered. If it exceeds a certain retry threshold, we move it to a Dead Letter Queue (DLQ).
You can either ACK the event and log it, or store it elsewhere for manual intervention. This ensures the system doesn’t loop forever on broken events.
4. Consumer Lifecycle Management
Unlike Kafka, Redis doesn’t offer built-in consumer lifecycle tracking.
We needed a synchronized list of alive consumers to use in our hash-based routing logic.
Option 1: XINFO CONSUMERS
You can use XINFO CONSUMERS
to check consumer activity, but this has caveats. If no messages exist, all consumers appear idle, and you may mistakenly remove healthy consumers.
Option 2: Custom Heartbeats
So we implemented a custom mechanism:
- Each consumer registers its ID as a key with a TTL (e.g.,
SET {streamKey}:alive:nodes:{consumerId}
) - It also adds its ID to a Redis Set (e.g.,
SADD {streamKey}:alive:nodes:set
) We periodically scan the set and check if the corresponding key still exists
- If not, we remove the ID from the set
- If yes, the consumer is alive
Note: Redis Sets don’t support TTLs per value, so TTLs were tracked separately via keys.
Also, wildcard search like {streamKey}:alive:nodes:*
isn’t indexed — full scan would be required. That’s why we kept the Set.
We also built a sync job to periodically clean up stale consumers from both Redis and the Stream Group.
Conclusion
Kafka and Amazon MQ are powerful — but heavy and expensive. Redis, which we already used, turned out to be a solid lightweight alternative.
This project was especially fun because I enjoy raw, low-level architecture. I had full control over the design, automation, and reliability mechanisms — everything from hash-based load distribution to consumer heartbeat tracking.
Sure, Redis lacks disk persistence, partitioning, or out-of-the-box rebalancing. But for lightweight events like user notifications, Redis Stream was a fast and reliable fit.
Most importantly, I got to deeply understand how a messaging system should work — and that made all the effort worth it.