I have a huge amount of data with the following structure:
(ID1, ID2) -> value
(12, 243) -> 7712
(311, 63) -> 123
...
The pure, binary, uncompressed data is just about 2,5 GiB, so it should basically fit into my RAM of 16GiB. But I'm getting into trouble with the allocation overhead when allocating many small objects.
Previously I stored all values with the same ID1 in the same object, so the lists inside the object only stored ID2 and the value. This approach was also useful for the actual context it is used for.
class DataStore {
int ID1;
std::unordered_map<int, int> data; //ID2 -> value
}
The probjem is, there is seemingly a huge overhead for loading many of those objects (millions of them)... Simply loading all the data this way obviously exceeds my RAM - no matter whether I use
DataStore* ds = new DataStore();
or just
DataStore ds = DataStore();
for storing them inside a vector or another sort of container.
Storing them in a nested hash map reduces the overhead (now it fits into my RAM), but it's still about 9,2GiB - I wonder if that's not reducable...
So, I tried
google::sparse_hash_map<int, google::sparse_hash_map<int, int>>
and std::map, std::unordered_map, but google::sparse_hash_map is currently the best one with the mentioned 9,2GB...
I also tried
std::map<std::pair<int, int>, int>
with a worse result.
I tried combining both keys in a long int with shifting, but the inserting performance was incredibly slow for some reason... So I have no idea how the memory usage would have been here, but it was far too slow, even with google sparsehash...
I read some things about pool allocators, but I'm not sure if I can allocate multiple containers in the same pool, which would be the only useful scenario, I think... I'm also not sure which pool allocator to use - if that would be a good idea - suggestions?
Lookup performance is pretty important, so a vector is not that optimal... Also inserting performance should not be too bad...
EDIT:
boost::flat_map Could be as memory efficient as I need it, but inserting is incredibly slow... I tried both
boost::container::flat_map<std::pair<unsigned int, unsigned int>, unsigned int>
boost::container::flat_map<unsigned int, boost::container::flat_map<unsigned int, unsigned int>>
Maybe I can have only good insert & lookup performance or memory efficience...
EDIT2:
google::sparse_hash_map<std::pair<unsigned int, unsigned int>, unsigned int, boost::hash<std::pair<unsigned int, unsigned int>>>
Uses about 6,3GiB, with some optimizions only 4,9GiB (value of (123,45) and (45,123) should be equal :) Really good - Thanks!
EDIT3:
In my context, because I now use pairs as key and no nested hashmaps, it's useless for most of my purposes (I often have to do things like "get all pairs which contain ID 12231 - so I have to iterate through all entries... One specific entry is nice, but a "half pair" is not nice to search. Another Index consumes almost as much memory as the nested hastmap solution... I will simply stick to my old approach (serialize the data and save it in some LevelDB), because it's incredibly fast with about 50ms for a specific part I need, compared to about two seconds for a whole iteration :) But thanks for all the advice :)