1

I need to store around 200,000 SHA256 hashes in binary form in memory.
My requirements are,

  • The data structure should be most memory efficient.
  • I will be reading back the hashes in sorted order (Insertion order is NOT important), so, the data structure supporting lexicographical reading is better.
  • It would be a plus (though not mandatory) if two structures of the same type can be compared to find the common hashes in them.

Here are the data structures I considered,

Arrays:

Arrays seems to be the most simple and memory efficient one, but I cannot use arrays because,

  1. I will have to sort the data while reading it. The data structure by itself does not support it.
  2. Since 200K hashes is not a hard limit and can also go more than that, I won't be knowing about the size before hand to allocate the array length. This means that I may sometimes need to resize the array by copying the whole contents of the array to a new array (having both the old and new ones in memory at the same time).

Compressed Radix Trie (Patricia Trie?)

Compressed Radix Trie seemed to be the most promising DS for my implementation. But a quick google search showed this link: https://news.ycombinator.com/item?id=101987 which said Radix Tries are not very memory optimized,

Quoting from the link:

Radix tries are nice. Use them when

...

(4) you don't care about memory usage that much.

I compared a simple 8-bit radix tree with some standard hash table implementation - the former took roughly ten times more memory. I then changed my radix to be based on 4 bits (each char is just split into 2 parts) and the memory usage improved twice. Now I'm wondering if radix tries have more room for improvement.

Hash Table?

I know hash tables don't support sorted reading like Radix tries do, but are they really so much memory optimal (10 times better than radix trees)?


I still don't understand/am not convinced, is a Compressed Radix Trie not a memory optimal data structure? If not, which Data structure would best suit my needs?

If Radix trie is the best one known, is there a optimal algorithm which compares 2 Radix tries to find the common hashes in them.


P.S: I found the following similar questions on SO but they didn't solve my problem:

Storing 1 million phone numbers: This didn't have much information so closed as "Not Constructive" and the answers are about finding deltas of the phone numbers. But deltas for hashes not be helpful?

Most memory efficient way to store 8M+ sha256 hashes: This was about storing a key-value mapping and the answers are asking to use databases.

Community
  • 1
  • 1
Codebender
  • 14,221
  • 7
  • 48
  • 85
  • Where is `ArrayList`? Your issues with arrays seem relatively easily addressed. Additionally, arrays contain references, not the actual strings, so you're not actually storing two of every hash in memory at once. – Louis Wasserman May 13 '16 at 18:19
  • `ArrayList` is the same as Arrays, only that their implementation is abstracted. array copies have to happen however. And ya, only references are copied, but even that seems too hugh for me, 200K on old array and 300K on new array seems a little too much compared to other structures. – Codebender May 13 '16 at 18:24
  • How about a [`TreeSet`](https://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html)? – shmosel May 13 '16 at 18:29
  • 1
    How frequently do you expect the array to be updated? A sha256 binary string is 32 bytes. 32*300,000 would be around 9.6m. Even if you consider 8 additional bytes for reference (64 bit) its 40*300,000=12m which does not seem much to me. Have you considered [In-place Radix Sort](https://en.wikipedia.org/wiki/Radix_sort#In-place_MSD_radix_sort_implementations) while using ArrayList (which is why asked how frequently you expected it to be updated)? – Ravindra HV May 13 '16 at 18:34
  • @Codebender that's not actually very much memory at all. – Louis Wasserman May 13 '16 at 18:51
  • I think RadixTrees (CritBit for binary data: https://cr.yp.to/critbit.html / https://github.com/tzaeschke/critbit) should be best. But they do have a bit of a memory overhead from having a lot of 'node' objects. With a 'trie' structure, that could be optimized by splitting in 4, 8, 16 or more sub-node in each nodes, that should reduce the memory overhead. – TilmannZ May 17 '16 at 12:59

0 Answers0