3

I have a huge amount of data with the following structure:

(ID1, ID2) -> value  
(12, 243) -> 7712  
(311, 63) -> 123  
...

The pure, binary, uncompressed data is just about 2,5 GiB, so it should basically fit into my RAM of 16GiB. But I'm getting into trouble with the allocation overhead when allocating many small objects.
Previously I stored all values with the same ID1 in the same object, so the lists inside the object only stored ID2 and the value. This approach was also useful for the actual context it is used for.

class DataStore {
    int ID1;
    std::unordered_map<int, int> data; //ID2 -> value
}

The probjem is, there is seemingly a huge overhead for loading many of those objects (millions of them)... Simply loading all the data this way obviously exceeds my RAM - no matter whether I use

DataStore* ds = new DataStore();

or just

DataStore ds = DataStore();

for storing them inside a vector or another sort of container.

Storing them in a nested hash map reduces the overhead (now it fits into my RAM), but it's still about 9,2GiB - I wonder if that's not reducable...

So, I tried

google::sparse_hash_map<int, google::sparse_hash_map<int, int>> 

and std::map, std::unordered_map, but google::sparse_hash_map is currently the best one with the mentioned 9,2GB...

I also tried

std::map<std::pair<int, int>, int> 

with a worse result.
I tried combining both keys in a long int with shifting, but the inserting performance was incredibly slow for some reason... So I have no idea how the memory usage would have been here, but it was far too slow, even with google sparsehash...

I read some things about pool allocators, but I'm not sure if I can allocate multiple containers in the same pool, which would be the only useful scenario, I think... I'm also not sure which pool allocator to use - if that would be a good idea - suggestions?

Lookup performance is pretty important, so a vector is not that optimal... Also inserting performance should not be too bad...

EDIT:

boost::flat_map Could be as memory efficient as I need it, but inserting is incredibly slow... I tried both

boost::container::flat_map<std::pair<unsigned int, unsigned int>, unsigned int>
boost::container::flat_map<unsigned int, boost::container::flat_map<unsigned int, unsigned int>>

Maybe I can have only good insert & lookup performance or memory efficience...

EDIT2:

google::sparse_hash_map<std::pair<unsigned int, unsigned int>, unsigned int, boost::hash<std::pair<unsigned int, unsigned int>>>

Uses about 6,3GiB, with some optimizions only 4,9GiB (value of (123,45) and (45,123) should be equal :) Really good - Thanks!

EDIT3:

In my context, because I now use pairs as key and no nested hashmaps, it's useless for most of my purposes (I often have to do things like "get all pairs which contain ID 12231 - so I have to iterate through all entries... One specific entry is nice, but a "half pair" is not nice to search. Another Index consumes almost as much memory as the nested hastmap solution... I will simply stick to my old approach (serialize the data and save it in some LevelDB), because it's incredibly fast with about 50ms for a specific part I need, compared to about two seconds for a whole iteration :) But thanks for all the advice :)

dvs23
  • 127
  • 10
  • 3
    Something like `std::vector>` (or the data wrapped in some appropriate custom type) should have next to no memory overhead. How are you using the `vector`? – Baum mit Augen Jan 14 '18 at 15:32
  • That's good for storing the data, but has a terrible lookup performance... And I heavily depend on that... I should add that to the question :) – dvs23 Jan 14 '18 at 15:34
  • 4
    Are you sure? If you sort the vector, binary search should easily beat `std::map`. – Baum mit Augen Jan 14 '18 at 15:35
  • `boost::flat_map` ? – rustyx Jan 14 '18 at 15:35
  • ^ Yeah, that's an implementation of the above idea. – Baum mit Augen Jan 14 '18 at 15:36
  • 2
    If you need lookup performance then use unordered_maps or sorted vectors. – Öö Tiib Jan 14 '18 at 15:37
  • Btw, if you want to go down the `unordered_map` path (which *might* win in lookup speed), use `std::vector, int>`, not that weird list of maps. – Baum mit Augen Jan 14 '18 at 15:39
  • flat_map looks good, I'll check that – dvs23 Jan 14 '18 at 15:47
  • Have you tried `std::unordered_map, int>`? (Wrote `vector` in my previous comment on accident.) – Baum mit Augen Jan 14 '18 at 16:04
  • flat_map is really really slow inserting the data... Previously, but with bad memory usage, I was able to insert about 1000000 in a minute, now it's less than a few 10000, maybe 100000... (Not single entries, but counted in blocks, grouped by ID1) – dvs23 Jan 14 '18 at 16:05
  • @BaummitAugen For that one I need a hash function, don't I? I'm not sure what to use there... – dvs23 Jan 14 '18 at 16:06
  • Searching "c++ unordered_map with pair as key" might help with that. Pretty sure I even posted one myself somewhere on SO. – Baum mit Augen Jan 14 '18 at 16:07
  • https://stackoverflow.com/questions/32685540/unordered-map-with-pair-as-key-not-compiling?lq=1 – dvs23 Jan 14 '18 at 16:11
  • sparse_hash_map with boost::hash seems to be pretty great :) I'll post the final memory usage, insert speed is fine :) – dvs23 Jan 14 '18 at 16:23
  • @BaummitAugen BTW, funny coincidence - I applied for an internship at IAIS some days ago, so maybe (if I get a positive respone) I'll be pretty close to your workspace in some weeks :D – dvs23 Jan 14 '18 at 16:50
  • Hah, neat. =D Good luck! – Baum mit Augen Jan 14 '18 at 16:58
  • Thank you :) Maybe we'll run over each other - I'll look for a @BaummitAugen ;) – dvs23 Jan 14 '18 at 17:34
  • What range of values can `ID1` have? Is it like zero to a billion with sparsely distributed integers, or fairly packed (ex: zero to one million, with all numbers in that range being used)? –  Jan 15 '18 at 13:06
  • sparsely distributed integers – dvs23 Jan 15 '18 at 13:23
  • I see. In that case, what I recommend just as a general guideline is not to think of what you're storing as `map,int>`, but rather `multimap>`. That is, think of your *value* as a pair of integers (ID2 and value), and think of your *key* as a single integer (ID1). That's not to say to use map and multimap, an RB-tree is probably the most explosive in memory here. But think of it like a multimap problem (unordered_multimap, e.g.), where you are looking for multiple matches to a single key (ID1 match) and then linearly traversing... –  Jan 15 '18 at 13:27
  • ... the matches to look for an ID2 match on top, and then you have the value. That should generally lead to the tightest memory use. It does mean you do have to loop through the ID1 matches for an ID2 match, but just a teeny bit of grunt work for half the key size. –  Jan 15 '18 at 13:27
  • The other is just that if you store like a million full-fledged containers (as in ones that own their memory, keep track of their size, etc) of any sort, that will tend to be explosive in memory use no matter what. Also not sure I explained so clearly in the above about halving the key size in exchange for doubling the value size. Many containers tend to use more memory for the keys and possible keys than the values, e.g. You'll tend to inflate their memory usage more with bigger keys than bigger values. –  Jan 15 '18 at 13:30
  • Hm, sounds good - maybe I'll try that, too :) – dvs23 Jan 15 '18 at 13:34
  • Cheers. The other benefit of treating just ID1 as key for the container is just that you can more easily make the search times more predictable.. like it might be constant-time to retrieve ID1 key matches and then linear-time among the matches to look for an ID2 sub-match... but linear-time might be dirt cheap here and could be cheaper and more predictable than, say, a hash table using a paired key which has a boatload of collisions. So I often find it's easier to control memory use and also get more predictable performance out of this kind of "half-key" approach when dealing with paired keys. –  Jan 15 '18 at 13:37
  • I moved [this discussion to a chat](http://chat.stackoverflow.com/rooms/163190/discussion-between-dvs23-and-team-upvote) - there are really many comments :D – dvs23 Jan 15 '18 at 13:39

0 Answers0