Caching strategy for large datasets using Redis on Windows 2008 R2

Question

I'm investigating whether or not to cache large datasets using Redis.

The largest of the datasets holds approximately 5 millions objects. Although each object has a unique identifier they're never used individually by the client; aggregate and join operations are performed on the whole dataset.

The target environment is 4 servers each with 144 Gb Ram, 24 cores and gigabit network cards - running Windows 2008 R2 enterprise. To that extent I've installed 10 instances of Redis-64.2.6.12.1 from Microsoft Open Technologies on each box. And I'm using ServiceStack's Redis client.

I've sharded the data into chunks of 1000 objects (this seems to give the best performance) and used the ShardedRedisClientManager to hash each chunk id to distribute the data across the 40 caches. An object map is persisted so that the client application can continue to retrieve all the objects using just the dataset id. Redis lists are used for both the objects and the object-map.

Transactions didn't improve the performance but, by grouping the chucks by connection, parallel processing did. However the performance is still unsatisfactory. The best time to set then get 5m objects plus the object-map is 268055 ms.

So, is there a better approach to caching large datasets using Redis? Is it even reasonable to cache such datasets? Should I make do with serializing to disk and move the processing to the data ala hadoop?

score 1 · Answer 1 · edited May 23 '17 at 11:59

The question isn't whether Redis is suitable for large datasets, it's whether or not your Dataset and use-case is suitable for Redis.

Redis essentially allows you to maintain distributed computer science collections and let you access and interact them in a Threadsafe atomic way in the optimal Big O notation performance each data collection type allows.

Network round-trip and bandwidth latency and Data access patterns

Redis may be fast, but it's still limited by Network latency and optimal data storage and access patterns, e.g. you still need to be concerned with number of Network round-trips and bandwidth that's required, whether you're data access requires full-table scans or can be reduced via custom indexes and the performance overhead of serialization library you're using.

Do you need full-table data scans or can you maintain custom indexes?
Do you need to transfer the entire DataSet?
Can you leverage server-side LUA operations to minimize round-trips and reduce bandwidth?

Should you use blob storage instead?

It seems odd to want to transfer the entire DataSet each time, which may be an indication that you shouldn't be maintaining and itemizing the dataset into Redis server collections. If you're only accessing and manipulating the dataset on the client then there's no real benefit of storing the data into Redis collections.

If you're use-case is what's the fastest way I can get 5M objects hydrated into in-memory .NET data structures, than that would just be to store the entire dataset as a blob into a single GET/SET entry using a fast binary format like ProtoBuf or MessagePack. In this way Redis is only acting like a fast in-memory blob storage. If access to the datastore doesn't need to be distributed (i.e. accessed over a network) than a fast embedded datastore like Level DB would be more optimal.

Distribute and chunkify dataset across multiple replicated or sharded redis servers

For maximum performance you could go further and use GETRANGE/SETRANGE to read chunks from multiple replicated redis-servers or just chunkify the serialized binary blob across multiple sharded redis servers - although this means that chunks by themselves are useless without their entire aggregate, so a corrupted chunk would invalidate the entire dataset.

Thanks, Demis. Unfortunately, the use case isn't going to quickly change but whether I use Redis is. Therefore, my question remains whether or not Redis is suitable. But that's semantics. Perhaps I should've given a little more context. When I say 'client' I mean one of the four servers (worker nodes) will want to process the dataset. A moment later a different node might want the dataset. Each time the dataset is requested it is used in its entirety. We're already using protobuf to serialize to disk (primary storage) so I'll certainly investigate sharding the blob. Thanks — Joe, Jun 17 '13 at 09:12

Caching strategy for large datasets using Redis on Windows 2008 R2

1 Answers1

Network round-trip and bandwidth latency and Data access patterns

Should you use blob storage instead?

Distribute and chunkify dataset across multiple replicated or sharded redis servers