I'd like to build a set S
of tens of millions of short alphanumeric strings (length ~ 10 on average, max 100). Goal: do a quick lookup if a given string s
is in S
.
A set/hashtable seems to be interesting because it allows a fast O(1) lookup.
However, I know that a set takes much more in RAM than the actual data (because of the actual internal hashtable/buckets). In the following test, I see that all 5-character strings aaaaa, aaaab, ..., zzzzz
take 57 MB for the content, but 170 MB as a Python set
(when pickled):
import string, pickle
from itertools import product
n = 5
L = [''.join(p) for p in product(string.ascii_lowercase, repeat=n)] # aaaaa, aaaab, aaaac, ..., zzzzz
L = set(L)
size1 = len(L) * n
size2 = len(pickle.dumps(L))
print("Data: %.2f MB" % (size1 / 1024**2))
print("Size of set: %.2f MB" % (size2 / 1024**2))
# Data: 56.65 MB
# Size of set: 169.99 MB
Are there other data structures (available in Python) which allow fast lookup (maybe not as fast as a set which might be optimal), without having the burden of a 1:3 factor in the size of the actual content in bytes vs. data structure size in bytes?
TL;DR: I'd like to see if a solution exists somewhere between just the raw bytes of the content (optimal in space, but very slow search), and a set/hashtable (optimal for search speed, but uses large memory)