Skip to main content Link Menu Expand (external link) Document Search Copy Copied

0. Introduction

Previously, the user notification logic was embedded directly in our business logic, with a failover sequence like PUSH → SMS → Alimtalk. Each stage was blocking and sequential, which meant the entire notification process could take up to 15 seconds to complete.

In other words, the user might have to wait 15 seconds after placing an order.

To address this, our team decided to offload time-consuming tasks to a separate server and handle notification dispatching through a distributed and failover-capable MQ system.

We initially searched for Amazon MQ. It offered easy integration with AWS Client VPN and IAM, but the pricing? A single broker was projected to cost about 1,000,000won/month.

a

Too expensive. So we pivoted: “Let’s build an MQ ourselves using Redis Stream, which we were already using in production.”

Here’s what we needed to address:

  1. Event Rebalancing
  2. Fair Event Distribution
  3. DLQ (Dead Letter Queue)
  4. Consumer Lifecycle Management

Let’s go over each one.

1. Event Rebalancing

When a consumer reads an event using XREADGROUP, Redis Stream stores it in the Pending Entry List (PEL), which means the message was delivered but not yet acknowledged (ACK).

If the consumer dies or fails to process the message, it stays in the PEL. After a certain idle_time, other healthy consumers must take over using XAUTOCLAIM or XCLAIM. This is known as event rebalancing.

To implement this, consumers periodically scan the PEL and reprocess failed messages that have been idle too long.

2. Fair Event Distribution

Each of our consumers polls Redis every second using XREADGROUP. The problem? If three consumers (C1, C2, C3) are running and their polling intervals are slightly offset, say:

  • C1 at 1.0s
  • C2 at 1.1s
  • C3 at 1.2s

Then C1 ends up consuming most of the traffic. We visualized this to make the imbalance clear:

a

To resolve this, we explored two strategies:

  1. Use a randomized polling interval per consumer (e.g., 1–1.3s)

    • Easy to implement
    • Doesn’t guarantee perfect balance
  2. Use deterministic sharding:

    aliveConsumerIds.minByOrNull { hash(eventId + it) } == myConsumerId
    
    • First, fetch events via XRANGE
    • Calculate the hash for each event + consumer ID
    • If I’m the assigned consumer, proceed with XREADGROUP
    • Prevents duplicates when combined with locking

The general process looks like this:

  • Reprocessing: XPENDING → filter by hash → XREADGROUP → XACK
  • New consumption: XINFO GROUPS → XRANGE → filter by hash → XREADGROUP → XACK

Preventing Duplicates

Let’s say C1 is about to process event-a because the hash-based filter selected it. Just before it locks and processes the event, a new consumer C2 joins. It also gets added to aliveConsumerIds, and now when calculating the hash, it too becomes eligible for event-a.

To avoid such duplication, C1 must acquire a lock before consuming, ensuring only one consumer processes an event.

3. Dead Letter Queue (DLQ)

Redis PEL tracks the number of times an event was delivered. If it exceeds a certain retry threshold, we move it to a Dead Letter Queue (DLQ).

You can either ACK the event and log it, or store it elsewhere for manual intervention. This ensures the system doesn’t loop forever on broken events.

4. Consumer Lifecycle Management

Unlike Kafka, Redis doesn’t offer built-in consumer lifecycle tracking.

We needed a synchronized list of alive consumers to use in our hash-based routing logic.

Option 1: XINFO CONSUMERS

You can use XINFO CONSUMERS to check consumer activity, but this has caveats. If no messages exist, all consumers appear idle, and you may mistakenly remove healthy consumers.

Option 2: Custom Heartbeats

So we implemented a custom mechanism:

  • Each consumer registers its ID as a key with a TTL (e.g., SET {streamKey}:alive:nodes:{consumerId})
  • It also adds its ID to a Redis Set (e.g., SADD {streamKey}:alive:nodes:set)
  • We periodically scan the set and check if the corresponding key still exists

    • If not, we remove the ID from the set
    • If yes, the consumer is alive

Note: Redis Sets don’t support TTLs per value, so TTLs were tracked separately via keys.

Also, wildcard search like {streamKey}:alive:nodes:* isn’t indexed — full scan would be required. That’s why we kept the Set.

We also built a sync job to periodically clean up stale consumers from both Redis and the Stream Group.

Conclusion

Kafka and Amazon MQ are powerful — but heavy and expensive. Redis, which we already used, turned out to be a solid lightweight alternative.

This project was especially fun because I enjoy raw, low-level architecture. I had full control over the design, automation, and reliability mechanisms — everything from hash-based load distribution to consumer heartbeat tracking.

Sure, Redis lacks disk persistence, partitioning, or out-of-the-box rebalancing. But for lightweight events like user notifications, Redis Stream was a fast and reliable fit.

Most importantly, I got to deeply understand how a messaging system should work — and that made all the effort worth it.