6

I’m interested in using a radix tree (or Patricia trie) to store a hash/dict/array of strings -> values. However, I'm finding that I have too many strings to fit into memory.

I found an article by Algolia about how they solved this problem with their search index and they talk about doing what I’m trying to do: flushing a radix tree to disk as each branch is constructed and only reading back the branches you need.

However, they don’t mention how they do this. The only way I can think of storing a radix tree is either as a full (serialized) object or as a hash/array as a simple Key/Value store.

For example, using a key/value store

SET smile:  [...values...]
SET smiled: [...values...]
SET smiles:  [...values...]
SET smiling: [...values...]

Then doing a prefix scan to pull out keys/values that MATCH smil*. However, this kind of loses the space-saving benefits of a radix tree plus it would require reconstructing at least part of the radix tree on load.

Xeoncross
  • 55,620
  • 80
  • 262
  • 364
  • Storing a list of pointers at each entry in the trie may be a solution. – Jason Sep 24 '19 at 20:29
  • If it helps, you might look at the prefixtrie files here: https://github.com/lbryio/lbrycrd/tree/master/src . It stores small values in RAM which can be used to look up the large counterparts in LevelDB. It also allows you to use a mmap allocator. – Brannon Oct 08 '19 at 15:42
  • @Jason if you mean pointers to another fork in the trie then that would require N lookups for every branch in the trie. I guess that could work, but it would be pretty slow. For `ask` (new query) `root -> a` (new query) `a -> s` (new query) `s -> k` I guess this is similar to graph database traversal. – Xeoncross Oct 08 '19 at 21:46

1 Answers1

0

Why not hash the strings in a pre-processing step, store the mapping out-of-core and build the trie on the hashes instead? This should significantly reduce the memory load since only the hashes are left to consider.

StarShine
  • 1,940
  • 1
  • 27
  • 45
  • When you say "hashes" are you talking about the 128bit-256bit outputs of something like SHA1-2 or do you mean something that reduces the terms down to a 32-64bit number like crypt32 or murmur? Because most words are only going to be about 64bits anyway (8 characters). Not sure what the hashing gains you. – Xeoncross Oct 11 '19 at 20:13
  • Agreed, if the trings are dwords or qwords it's probably not worth the effort to hash. If the domain coverage is sparse or has repeated patterns of bits you could try compression, but otherwise an adapted algorithm might be a better bet. – StarShine Oct 15 '19 at 08:06