I'm looking for a set-like data structure in Python that allows a fast lookup (O(1) for sets), for 100 millions of short strings (or bytes-strings) of length ~ 10.
With 10M strings, this already takes 750 MB RAM on Python 3.7 or 3.10.2 on (or 900 MB if we replace the b-strings by strings):
S = set(b"a%09i" % i for i in range(10_000_000)) # { b"a000000000", b"a000000001", ... }
whereas the "real data" here is 10 bytes * 10M ~ 100 MB. So there is a 7.5x memory consumption factor because of the set structure, pointers, buckets... (for a study about this in the case of a list, see the answer of Memory usage of a list of millions of strings in Python).
When working with "short" strings, having pointers to the strings (probably taking 64 bit = 8 bytes) in the internal structure is probably already responsible for a 2x factor, and also the buckets structure of the hash-table, etc.
Are there some "short string optimizations" techniques allowing to have a memory-efficient set of short bytes-strings in Python? (or any other structure allowing fast lookup/membership test)
Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.
Or would using a bisect
or a sorted list help (lookup in O(log n) might be ok), while keeping memory usage small? (smaller than a 7.5x factor as with a set)