0

I want to be able to test, as fast as possible, if a given string is a member of a set S of 1 billion of strings:

  • each string is of length 10 on average, so the whole data is ~ 10 GB, the max length is 100

  • 99% of the strings uses only a-z A-Z 0-9 + a few special chars (.!-;\/), in some rare cases there are other chars

  • if necessary, the data could be sorted:

    aaaaa
    ...
    helloworld
    hellu
    help
    ...
    zzzzz
    

    and we could maybe use a prefix tree / trie, but I'm not sure if it's the best solution here.

Question: Is there a way to work with an in-memory compressed set with Python, possibly with the standard library, with fast lookup?

The goal is to test 'hello' in S as fast as possible.

I know how to do this with 10 GB of RAM, but out of curiosity, I'd like to know if this is possible with, say, only 1 or 2 GB of RAM:

S = set()
with open('db.txt', 'r') as f:
   for l in f:
       S.add(l)
print('hello' in S)
Basj
  • 41,386
  • 99
  • 383
  • 673
  • Depending on the nature of your data, maybe a radix tree? A specialized one like a HAT-trie. But you wouldn't be able to create that space-efficiently in Python, maybe with a C-extension – juanpa.arrivillaga Feb 09 '22 at 19:33
  • Sounds like a perfect application for a Bloom filter to eliminate most misses almost immediately. – Mark Ransom Feb 11 '22 at 21:07
  • Thanks @MarkRansom. Could you post a Python solution with such a bloom filter? Would be interesting! – Basj Feb 15 '22 at 18:00
  • I might give it a shot later, but I'm unlikely to do better than those who have already spent some time on it. A simple search for "python bloom filter" should turn up some good possibilities. – Mark Ransom Feb 15 '22 at 21:50

2 Answers2

1

You should be able to compress your strings and work with the compressed version. However, just how much memory you save will depend on how compressible your dataset is.

I would suggest a custom compression system: build a single compression dictionary based on your entire string corpus, optimized to minimize the total size of dictionary + compressed string storage.

I would use a hash table to look up your compressed strings, not a tree. Each hash bucket can just be a bunch of compressed data, scanned sequentially from the beginning. For buckets that happen to have a lot of long strings, provide an overflow pointer to the rest of the data; the in-table bucket size is a performance parameter -- or you can slide that size all the way to 0 for a fully indirect table.

The sequential scan sounds problematic from a performance point of view, but both the initial hash lookup and the sequential scan are fast on current hardware. The number of hash buckets is what (statistically) limits the sequential scan length; it is another performance parameter.


To implement this in Python, I'd do the hash indexing manually, and store the hash table as an array of bytestrings. I think this will effectively be the indirect table mentioned above, which might not be as fast as possible but it should be pretty good.

# Python-like pseudocode for building the table
compression_dict = compressor.train(strings)
hash_table = [BytesIO() for i in range(table_size)]
for s in strings:
    hash_table[hash(s) % table_size].write(compression_dict.compress(s))
comingstorm
  • 25,557
  • 3
  • 43
  • 67
  • Thank you for your answer. Are there some tools available in Python (possibly in the standard library) that can give access to a compression of individual strings based on the whole text corpus? Something like: `d = build_dict(whole_string)`, `s = compress('hello', according_to_compression_dict=d)` (pseudo code) – Basj Feb 10 '22 at 07:50
  • For your second paragraph, could you explain the method in a few lines of pseudo-code? (I will try to implement this) – Basj Feb 10 '22 at 07:50
  • A brief Google search turned up https://stackoverflow.com/questions/56189234/compression-of-short-strings (I will add some pseudocode per your request) – comingstorm Feb 10 '22 at 19:41
1

No.

Unless there is some massive redundancy in the data you're not telling us about, you can expect to compress the 10 GB down to maybe 6 or 7 GB, even sorted. A data structure like you suggest would make it massively larger, not smaller.

As suggested in another answer, you can make it fast with a hashing approach. But not reduce the storage by a factor of five to ten.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thank you for your answer. Most strings are combinations of common English words, I noticed that ZIP compresses the whole text file from 10 GB to 1.5 GB, so it's quite good. Is there a method / data structure that allows us to work and do the look up with the compressed data? – Basj Feb 10 '22 at 07:54