I'm investigating whether or not to cache large datasets using Redis.
The largest of the datasets holds approximately 5 millions objects. Although each object has a unique identifier they're never used individually by the client; aggregate and join operations are performed on the whole dataset.
The target environment is 4 servers each with 144 Gb Ram, 24 cores and gigabit network cards - running Windows 2008 R2 enterprise. To that extent I've installed 10 instances of Redis-64.2.6.12.1 from Microsoft Open Technologies on each box. And I'm using ServiceStack's Redis client.
I've sharded the data into chunks of 1000 objects (this seems to give the best performance) and used the ShardedRedisClientManager to hash each chunk id to distribute the data across the 40 caches. An object map is persisted so that the client application can continue to retrieve all the objects using just the dataset id. Redis lists are used for both the objects and the object-map.
Transactions didn't improve the performance but, by grouping the chucks by connection, parallel processing did. However the performance is still unsatisfactory. The best time to set then get 5m objects plus the object-map is 268055 ms.
So, is there a better approach to caching large datasets using Redis? Is it even reasonable to cache such datasets? Should I make do with serializing to disk and move the processing to the data ala hadoop?