Redis BigKey 문제는 왜 위험한가

Redis BigKey 문제를 그냥 큰 key 하나 있으면 메모리 많이 먹겠네? 정도로 생각하면 안된다.

물론 메모리도 문제다. 그런데 실무에서 더 짜증나는 지점은 큰 key 하나가 Redis 의 싱글스레드 처리 모델, 네트워크 전송량, 클러스터 슬롯 분산, eviction, replication 까지 같이 흔든다는 점이다.

Redis 는 빠르다. 그런데 빠른 이유가 모든 작업을 마법처럼 처리해서가 아니라, 대부분의 명령이 메모리에서 짧게 끝나고 이벤트 루프를 오래 점유하지 않기 때문이다. 그런데 BigKey 가 끼면 이 전제가 깨진다.

결국 BigKey 문제는 “Redis 에 큰 데이터 넣지마” 가 아니라, “하나의 command 가 Redis 를 오래 붙잡게 만들지마” 에 가깝다.

BigKey 란?

BigKey 는 말 그대로 너무 큰 key 다.

근데 여기서 크다는게 단순히 String value 가 10MB 이런 케이스만 말하는게 아니다.

String value 자체가 큰 경우
하나의 Hash 에 field 가 너무 많이 들어간 경우
하나의 List, Set, Sorted Set 에 element 가 너무 많이 들어간 경우
하나의 Stream key 에 entry 가 계속 쌓이는 경우

이런 것들이 다 BigKey 가 될 수 있다.

예를 들어 user:1:profile 이 2KB 인건 별 문제 없을 수 있다. 그런데 user:all:sessions 같은 key 하나에 수십만개 session 을 Hash 로 몰아넣으면 얘는 BigKey 다.

이게 더 위험한 이유는 겉으로 보기엔 key 개수가 별로 없어보인다는 점이다. DBSIZE 보면 key 수가 적으니까 괜찮아보인다. 그런데 실제론 key 하나 내부에 데이터가 몰려있어서 특정 명령 하나가 오래 걸린다.

왜 문제가 생기는데?

1. Redis 이벤트 루프를 오래 잡아먹는다

Redis 는 기본적으로 command 를 짧게 처리할 때 강하다. 그런데 큰 Hash 전체를 읽거나, 큰 List 를 크게 잘라오거나, 큰 Set 을 한번에 지우면 Redis 는 그 작업을 처리하는 동안 다른 요청 처리가 밀릴 수 있다.

예를 들어 이런식이다.

HGETALL user:sessions
LRANGE feed:global 0 -1
SMEMBERS online:users
DEL huge:key

작은 데이터면 별 문제 없다. 그런데 내부 element 가 수십만개면 말이 달라진다. 애플리케이션 입장에서는 Redis 가 갑자기 느려진 것처럼 보이는데, 실제로는 특정 큰 key 를 처리하느라 이벤트 루프가 오래 점유된 것이다.

그래서 Redis latency 문제를 볼 때 단순히 CPU 평균만 보면 놓칠 수 있다. 평균 CPU 는 괜찮은데 p99 latency 만 튀는 식으로 나온다.

2. 네트워크 전송량이 갑자기 커진다

BigKey 를 읽는 명령은 Redis 내부 처리만 느린게 아니다. 결과를 클라이언트로 보내야 한다.

1MB value 하나 가져오는 것과 1KB value 천개 가져오는건 다르다. 특히 API 요청 path 에서 BigKey 를 읽으면, Redis -> WAS 네트워크 전송량이 커지고, WAS 쪽 deserialize 비용도 같이 늘어난다.

즉 Redis 만 느린게 아니라 애플리케이션 스레드도 같이 잡아먹는다.

3. Cluster 에서 slot 불균형이 생긴다

Redis Cluster 는 key 단위로 slot 에 배치된다. 그런데 key 하나가 너무 커지면 해당 slot 을 가진 node 에 메모리와 traffic 이 몰린다.

예를 들어 stream 을 하나의 key 로만 쓰면 그 stream key 는 특정 slot 에만 들어간다. consumer 를 여러개 붙여도 결국 write/read 는 해당 key 가 있는 node 로 몰린다. 이건 예전에 Redis Stream Sharding 글에서도 비슷하게 정리했는데, BigKey 도 결국 같은 문제를 만든다.

key 가 하나니까 관리하기 편하겠지? 라고 생각했는데, 운영가면 아 이거 한 node 만 터지네 가 된다.

4. 삭제도 문제다

큰 key 는 읽을 때만 문제가 아니다. 지울 때도 문제다.

DEL huge:key 는 key 내부 메모리를 해제해야 한다. 데이터가 작으면 별 문제 없지만, 내부 element 가 엄청 크면 삭제 작업 자체가 Redis 를 오래 붙잡을 수 있다.

그래서 큰 key 를 지울 때는 DEL 보다 UNLINK 를 우선 고려하는게 낫다. UNLINK 는 key 를 keyspace 에서 먼저 제거하고 실제 메모리 해제를 비동기로 넘길 수 있어서, 큰 key 삭제로 인한 blocking 을 줄일 수 있다.

물론 이것도 만능은 아니다. 애초에 큰 key 를 계속 만들지 않는게 제일 낫다.

어떻게 찾을까?

일단 운영에서 KEYS * 이런거 치면 안된다. Redis 공식 문서에서도 production latency 의 흔한 원인으로 KEYS 같은 slow command 를 언급하고, keyspace 나 큰 collection 은 SCAN, SSCAN, HSCAN, ZSCAN 처럼 incremental 하게 순회하라고 설명한다.

BigKey 찾을 때는 보통 아래 흐름으로 본다.

redis-cli --bigkeys
redis-cli --memkeys
redis-cli --keystats

--bigkeys 는 type 별로 큰 key 를 찾을 때 유용하고, --memkeys 는 메모리 기준으로 큰 key 를 볼 때 좋다. 최근 Redis CLI 문서 기준으로는 --keystats 로 --bigkeys, --memkeys 성격의 정보를 묶어서 size 분포까지 볼 수 있다.

운영에서 돌릴 땐 부하를 줄이기 위해 scan 사이에 sleep 을 줄 수도 있다.

redis-cli --keystats -i 0.1

이런식으로 100번 SCAN 마다 0.1초 쉬게 만들 수 있다. 물론 트래픽 많은 시간대에 막 돌릴건 아니고, 모니터링/점검 시간대에 조심해서 봐야한다.

그리고 특정 key 가 의심되면 MEMORY USAGE 로 실제 메모리 사용량을 확인한다.

MEMORY USAGE user:sessions
HLEN user:sessions
LLEN feed:global
SCARD online:users
XLEN user:events

여기서 중요한건 String 은 byte size 를 보면 되지만, Hash/List/Set/Stream 은 내부 element 개수까지 같이 봐야한다는 점이다.

보통 어느 정도부터 조심해야할까?

이건 사실 정답이 없다. Redis 공식 문서가 “몇 MB 부터 BigKey 다” 라고 하나로 박아두진 않는다. 왜냐면 같은 5MB key 라도 read path 에 있느냐, batch path 에 있느냐, cluster 에서 slot 이동 대상이냐, client output buffer 를 얼마나 잡아먹느냐에 따라 위험도가 달라지기 때문이다.

그래도 운영 기준을 아예 안잡으면 결국 이정도는 괜찮겠지 하다가 커진다. 그래서 레퍼런스 기준을 같이 보면 좋다.

Tencent Cloud Redis 문서는 String value 가 10MB 초과면 Big key 로 보고, Set/List 는 10,000 members 를 예시로 든다.
Alibaba Cloud Tair/Redis 파라미터 문서는 Top Key Statistics 의 기본 BigKey element threshold 를 2,000개, memory threshold 를 512MB 로 둔다. 이건 탐지 파라미터라서 “512MB 까지 괜찮다” 는 의미는 아니고, 관리형 서비스에서 통계상 큰 key 로 잡기 위한 기본값에 가깝다.
AWS ElastiCache 문서는 multi-GB Hash 같은 큰 composite item 을 권장하지 않고, cluster mode 에서 serialized size 가 256MB 초과인 item 이 들어있는 slot 은 migration 하지 않는다고 설명한다.

이걸 내 기준으로 다시 정리하면 이렇다.

타입	경고선	위험선
String	1MB 이상	10MB 이상
Hash	field 1,000개 이상	field 10,000개 이상 또는 HGETALL 필요
List	element 5,000개 이상	element 10,000개 이상 또는 전체 LRANGE 필요
Set/ZSet	element 5,000개 이상	element 10,000개 이상 또는 전체 SMEMBERS/ZRANGE 필요
Stream	trim 기준 없음	XLEN 이 계속 증가

물론 이 기준은 보수적으로 잡은거다. 1MB String 이 무조건 장애다 라는 뜻은 아니다. 그런데 Redis 를 API read path 에 두고 p99 latency 를 신경써야 한다면 1MB 도 이미 충분히 크다. 반대로 batch 에서 하루 한번 읽는 key 라면 5MB 여도 문제 없을 수 있다.

그래서 나는 보통 이렇게 본다.

1MB 넘는 String 은 일단 의심한다.
collection 계열은 5,000개 넘으면 쪼갤 수 있는지 본다.
10,000개 넘는 collection 에 전체 조회 명령이 있으면 거의 수정대상으로 본다.
100MB 넘는 key 는 운영상 사고 후보로 보고 제거/분할 계획을 잡는다.
256MB 근처까지 간 key 는 cluster 운영에서 slot migration 까지 막을 수 있으니 방치하면 안된다.

결국 기준은 몇 MB 면 장애 가 아니라, 그 key 를 한번에 읽고/쓰고/지우고/이동할 일이 있는가 로 잡아야한다.

어떻게 막을까?

1. key 를 쪼갠다

제일 단순하고 효과적인 방법은 key 를 쪼개는거다.

예를 들어 전체 session 을 하나의 Hash 에 넣지 말고 user 단위로 나누는 식이다.

나쁜 예시:
user:sessions

나은 예시:
user:{userId}:sessions

혹은 list/feed 같은 것도 page 단위, bucket 단위로 나눌 수 있다.

feed:global:2026-05-21:0
feed:global:2026-05-21:1
feed:global:2026-05-21:2

이렇게 해두면 조회도 작게 끊을 수 있고, 삭제도 작게 끊을 수 있다.

2. 전체 조회 명령을 피한다

큰 collection 에서 아래 명령은 조심해야한다.

HGETALL
SMEMBERS
LRANGE key 0 -1
ZRANGE key 0 -1

필요한 만큼만 가져와야한다.

HSCAN user:sessions 0 COUNT 100
SSCAN online:users 0 COUNT 100
LRANGE feed:global 0 49
ZRANGE ranking 0 99

Redis 를 cache 처럼 쓰면서 전체 데이터를 한번에 다 긁어오는 순간, cache 가 아니라 작은 DB 처럼 쓰고 있는 것이다. 그럼 Redis 장점이 점점 사라진다.

3. Stream 은 trimming 기준을 무조건 둔다

Redis Stream 은 방치하면 자연스럽게 BigKey 가 되기 쉽다.

XADD user:events * userId 1 event login

이런식으로 계속 넣기만 하고 trim 을 안하면 stream key 하나가 계속 커진다. 그래서 stream 은 처음 설계할 때부터 MAXLEN 기준을 둬야한다.

XADD user:events MAXLEN ~ 100000 * userId 1 event login

정확히 몇개로 자를지는 서비스 성격마다 다르다. 중요한건 “언젠가 지우겠지” 가 아니라, produce 시점부터 용량 관리 기준을 넣어야 한다는 점이다.

4. 큰 key 삭제는 UNLINK 를 우선 고려한다

운영에서 BigKey 를 발견했다고 바로 DEL 치는건 위험할 수 있다.

UNLINK huge:key

이렇게 비동기 해제를 우선 고려하는게 낫다. 물론 이것도 트래픽 많은 시간에 막 치기보단 영향도 보고 해야한다.

내가 보는 기준

나는 Redis BigKey 를 볼 때 기준을 이렇게 잡는게 낫다고 생각한다.

key 하나가 커지는 구조인가?
그 key 에 대해 전체 조회 명령이 실행될 수 있는가?
삭제/만료 시점에 한번에 큰 메모리를 해제해야 하는가?
cluster 에서 특정 slot/node 로만 부하가 몰리는가?
이 데이터가 Redis 에 꼭 이 형태로 있어야 하는가?

여기서 1, 2번이 동시에 yes 면 거의 언젠가 터진다고 보는게 맞다.

Redis 는 빠른데, 빠르게 쓰려면 데이터를 작게 쪼개고 command 도 짧게 끝나도록 만들어야한다. BigKey 는 그 반대다. 데이터를 한 key 에 몰아넣고, 하나의 command 가 너무 많은 일을 하게 만든다.

그래서 BigKey 문제는 메모리 최적화 문제가 아니라 Redis 를 Redis 답게 쓰고 있는가 문제에 가깝다.

정리

Redis BigKey 는 단순히 큰 key 하나가 메모리를 많이 먹는 문제가 아니다.

command latency 를 튀게 만들고
네트워크 전송량을 키우고
cluster slot/node 불균형을 만들고
삭제/만료 시점에도 blocking 위험을 만들고
replication, persistence 비용까지 키울 수 있다

결국 해결책은 단순하다.

큰 key 를 만들지 말고, key 를 쪼개고, 전체 조회를 피하고, stream/list/set/hash 는 처음부터 용량 상한을 설계해야한다.

Redis 는 작은 작업을 빠르게 처리할 때 강하다. 그러니까 Redis 에 큰 작업을 던지지 않는게 제일 중요하다. 내 경우엔 Stream 을 분산시킬려고하는 이유 중 하나가 바로 이 이슈때문이다!

Why Redis BigKey Is Dangerous

Redis BigKey should not be understood as simply “one large key uses a lot of memory.”

Memory is obviously part of the problem. But in practice, the more annoying part is that one large key can shake Redis’s single-threaded event loop, network transfer volume, cluster slot distribution, eviction, and replication at the same time.

Redis is fast. But it is not fast because it magically handles every operation. It is fast because most commands finish quickly in memory and do not occupy the event loop for long. BigKey breaks that assumption.

So the BigKey problem is less about “do not put large data into Redis” and closer to “do not make one command hold Redis for too long.”

What is a BigKey?

A BigKey is literally a key that is too large.

But “large” does not only mean a case like String value is 10MB.

A String value itself is large
One Hash contains too many fields
One List, Set, or Sorted Set contains too many elements
One Stream key keeps accumulating entries

All of these can become BigKeys.

For example, user:1:profile being 2KB may not matter. But if a key like user:all:sessions stores hundreds of thousands of sessions in a single Hash, that is a BigKey.

This is more dangerous because it can look fine from the outside. DBSIZE may show only a small number of keys, so it looks okay. But in reality, data is concentrated inside one key, so one specific command can take a long time.

Why does it become a problem?

1. It occupies the Redis event loop for too long

Redis is strong when commands finish quickly. But if it reads an entire large Hash, slices a large List heavily, or deletes a large Set at once, other requests can be delayed while Redis is processing that work.

For example:

HGETALL user:sessions
LRANGE feed:global 0 -1
SMEMBERS online:users
DEL huge:key

With small data, these may not matter. But if the internal element count is hundreds of thousands, the story changes. From the application side, Redis suddenly looks slow. In reality, the event loop is occupied by processing one large key.

So when looking at Redis latency, average CPU alone can miss the problem. CPU may look fine, but p99 latency can spike.

2. Network transfer suddenly increases

Reading a BigKey is not only slow inside Redis. Redis also has to send the result back to the client.

Fetching one 1MB value is different from fetching one thousand 1KB values. Especially if a BigKey is read in an API request path, Redis -> WAS network transfer increases, and deserialization cost on the WAS side also increases.

In other words, Redis is not the only thing that becomes slow. Application threads are also consumed.

3. Slot imbalance occurs in Cluster

Redis Cluster places keys by slot. If one key becomes too large, memory and traffic are concentrated on the node that owns that slot.

For example, if a stream is used as a single key, that stream key belongs to one specific slot. Even if multiple consumers are attached, writes and reads eventually concentrate on the node that owns that key. I covered a similar issue in the Redis Stream Sharding post. BigKey creates the same kind of problem.

It starts with one key should be easy to manage, but in production it becomes only one node is burning.

4. Deletion is also a problem

Large keys are not only a problem when reading. They are also a problem when deleting.

DEL huge:key has to release the memory used by the key. With small data this is fine, but if the internal element count is huge, deletion itself can hold Redis for a long time.

So when deleting a large key, consider UNLINK before DEL. UNLINK removes the key from keyspace first and moves actual memory freeing to an asynchronous path, which can reduce blocking caused by large key deletion.

Of course, this is not a silver bullet. The best option is to avoid creating large keys in the first place.

How do we find it?

First, do not run KEYS * in production. Redis documentation also mentions slow commands like KEYS as a common cause of production latency, and recommends scanning keyspace or large collections incrementally using SCAN, SSCAN, HSCAN, and ZSCAN.

To find BigKeys, I usually start with this flow:

redis-cli --bigkeys
redis-cli --memkeys
redis-cli --keystats

--bigkeys is useful for finding large keys by type, and --memkeys is useful when looking by memory usage. In recent Redis CLI documentation, --keystats groups information similar to --bigkeys and --memkeys, including size distribution.

In production, you can add sleep between scans to reduce load.

redis-cli --keystats -i 0.1

This sleeps for 0.1 seconds every 100 SCAN calls. Of course, this should not be run casually during peak traffic. It should be used carefully during monitoring or inspection windows.

If a specific key looks suspicious, check actual memory usage with MEMORY USAGE.

MEMORY USAGE user:sessions
HLEN user:sessions
LLEN feed:global
SCARD online:users
XLEN user:events

The important point is that for String you can look at byte size, but for Hash/List/Set/Stream you need to check the internal element count as well.

What size should we usually keep it under?

There is no single correct answer. Redis documentation does not define one fixed threshold like “from X MB it is a BigKey.” The risk changes depending on whether the same 5MB key is in a read path, a batch path, a cluster slot migration target, or whether it pressures the client output buffer.

Still, if you do not define any operational standard, it eventually grows under the assumption that this much should be fine. So it is useful to look at references.

Tencent Cloud Redis documentation treats a String value over 10MB as a BigKey and gives 10,000 members as an example for Set/List.
Alibaba Cloud Tair/Redis parameter documentation sets the default BigKey element threshold for Top Key Statistics to 2,000 elements, and the memory threshold to 512MB. This is a detection parameter, not a statement that “512MB is safe.”
AWS ElastiCache documentation does not recommend large composite items such as multi-GB Hashes, and explains that in cluster mode a slot containing an item with serialized size over 256MB is not migrated.

My own operating guideline is:

Type	Warning line	Danger line
String	1MB or more	10MB or more
Hash	1,000+ fields	10,000+ fields or requires HGETALL
List	5,000+ elements	10,000+ elements or requires full LRANGE
Set/ZSet	5,000+ elements	10,000+ elements or requires full SMEMBERS/ZRANGE
Stream	No trim policy	XLEN keeps increasing

This is intentionally conservative. It does not mean a 1MB String is always an incident. But if Redis is in an API read path and p99 latency matters, 1MB is already large enough. On the other hand, a 5MB key read once a day in batch may not be a problem.

So I usually look at it like this:

Any String over 1MB is suspicious.
For collection types, if it exceeds 5,000 elements, check whether it can be split.
If a collection exceeds 10,000 elements and has a full-read command, I almost always treat it as a fix target.
Any key over 100MB is an operational incident candidate and needs a removal/splitting plan.
A key near 256MB can block slot migration in cluster operations, so it should not be left alone.

The standard is not X MB means incident. It should be whether that key is ever read, written, deleted, or moved at once.

How do we prevent it?

1. Split keys

The simplest and most effective method is to split keys.

For example, instead of putting all sessions into one Hash, split them by user.

Bad:
user:sessions

Better:
user:{userId}:sessions

List/feed data can also be split by page or bucket.

feed:global:2026-05-21:0
feed:global:2026-05-21:1
feed:global:2026-05-21:2

Then reads and deletes can both be handled in smaller chunks.

2. Avoid full-read commands

Be careful with these commands on large collections:

HGETALL
SMEMBERS
LRANGE key 0 -1
ZRANGE key 0 -1

Fetch only what is needed.

HSCAN user:sessions 0 COUNT 100
SSCAN online:users 0 COUNT 100
LRANGE feed:global 0 49
ZRANGE ranking 0 99

If you use Redis as a cache but read the entire dataset at once, you are no longer using it like a cache. You are using it like a small DB. Then Redis gradually loses its advantage.

3. Always define a trimming policy for Stream

Redis Stream naturally becomes a BigKey if left alone.

XADD user:events * userId 1 event login

If you keep adding entries without trimming, one stream key keeps growing. So Stream should have a MAXLEN policy from the beginning.

XADD user:events MAXLEN ~ 100000 * userId 1 event login

The exact limit depends on the service. The important point is that capacity control should be designed at produce time, not left as we will delete it someday.

4. Prefer UNLINK for deleting large keys

If you find a BigKey in production, running DEL immediately can be risky.

UNLINK huge:key

Prefer asynchronous memory release. Of course, even this should be done after checking impact, not casually during peak traffic.

My 기준

When I look at Redis BigKey, I use these questions:

Is this structure designed so that one key keeps growing?
Can a full-read command be executed against this key?
Does deletion/expiration release a large amount of memory at once?
Does it concentrate load on a specific slot/node in cluster mode?
Does this data really need to exist in Redis in this shape?

If both 1 and 2 are yes, I assume it will eventually break.

Redis is fast, but to use it fast, data should be split small and commands should finish quickly. BigKey is the opposite. It concentrates data into one key and makes one command do too much work.

So BigKey is not simply a memory optimization issue. It is closer to whether Redis is being used like Redis.

Summary

Redis BigKey is not just about one large key consuming a lot of memory.

It spikes command latency
It increases network transfer
It creates cluster slot/node imbalance
It creates blocking risk during deletion/expiration
It can increase replication and persistence cost

The solution is simple.

Do not create large keys. Split keys. Avoid full reads. For stream/list/set/hash, design capacity limits from the beginning.

Redis is strong when it processes small jobs quickly. So the most important thing is not to throw large jobs at Redis. One reason I want to distribute Stream is exactly because of this issue.