I need to store around 200,000 SHA256 hashes in binary form in memory.
My requirements are,
- The data structure should be most memory efficient.
- I will be reading back the hashes in sorted order (Insertion order is NOT important), so, the data structure supporting lexicographical reading is better.
- It would be a plus (though not mandatory) if two structures of the same type can be compared to find the common hashes in them.
Here are the data structures I considered,
Arrays:
Arrays seems to be the most simple and memory efficient one, but I cannot use arrays because,
- I will have to sort the data while reading it. The data structure by itself does not support it.
- Since 200K hashes is not a hard limit and can also go more than that, I won't be knowing about the size before hand to allocate the array length. This means that I may sometimes need to resize the array by copying the whole contents of the array to a new array (having both the old and new ones in memory at the same time).
Compressed Radix Trie (Patricia Trie?)
Compressed Radix Trie seemed to be the most promising DS for my implementation. But a quick google search showed this link: https://news.ycombinator.com/item?id=101987 which said Radix Tries are not very memory optimized,
Quoting from the link:
Radix tries are nice. Use them when
...
(4) you don't care about memory usage that much.
I compared a simple 8-bit radix tree with some standard hash table implementation - the former took roughly ten times more memory. I then changed my radix to be based on 4 bits (each char is just split into 2 parts) and the memory usage improved twice. Now I'm wondering if radix tries have more room for improvement.
Hash Table?
I know hash tables don't support sorted reading like Radix tries do, but are they really so much memory optimal (10 times better than radix trees)?
I still don't understand/am not convinced, is a Compressed Radix Trie not a memory optimal data structure? If not, which Data structure would best suit my needs?
If Radix trie is the best one known, is there a optimal algorithm which compares 2 Radix tries to find the common hashes in them.
P.S: I found the following similar questions on SO but they didn't solve my problem:
Storing 1 million phone numbers: This didn't have much information so closed as "Not Constructive" and the answers are about finding deltas of the phone numbers. But deltas for hashes not be helpful?
Most memory efficient way to store 8M+ sha256 hashes: This was about storing a key-value mapping and the answers are asking to use databases.