Data structures with fast lookup (as close as possible to sets/hashtables) but with lower memory usage?

Question

I'd like to build a set S of tens of millions of short alphanumeric strings (length ~ 10 on average, max 100). Goal: do a quick lookup if a given string s is in S.

A set/hashtable seems to be interesting because it allows a fast O(1) lookup. However, I know that a set takes much more in RAM than the actual data (because of the actual internal hashtable/buckets). In the following test, I see that all 5-character strings aaaaa, aaaab, ..., zzzzz take 57 MB for the content, but 170 MB as a Python set (when pickled):

import string, pickle
from itertools import product
n = 5
L = [''.join(p) for p in product(string.ascii_lowercase, repeat=n)]  # aaaaa, aaaab, aaaac, ..., zzzzz
L = set(L)
size1 = len(L) * n
size2 = len(pickle.dumps(L))
print("Data: %.2f MB" % (size1 / 1024**2))
print("Size of set: %.2f MB" % (size2 / 1024**2))
# Data: 56.65 MB
# Size of set: 169.99 MB

Are there other data structures (available in Python) which allow fast lookup (maybe not as fast as a set which might be optimal), without having the burden of a 1:3 factor in the size of the actual content in bytes vs. data structure size in bytes?

TL;DR: I'd like to see if a solution exists somewhere between just the raw bytes of the content (optimal in space, but very slow search), and a set/hashtable (optimal for search speed, but uses large memory)

https://en.wikipedia.org/wiki/Trie would be my first impulse here. you'd probably need a memory efficient non-object-oriented implementation, because otherwise each node object is probably larger than the string it stores. — timgeb, Feb 10 '22 at 14:11
Yes this is the problem @timgeb, I've already used a trie in the past, but here I fear the objects will take a lot of memory for themselves. — Basj, Feb 10 '22 at 14:15
When I run your code, the inflation isn't nearly as drastic: goes from 56.65MB to 90.68MB. — Scott Hunter, Feb 10 '22 at 14:20
@ScottHunter I'm on Python 3.7.6 Windows 64-bit, and you? Maybe it has been improved since 3.7. — Basj, Feb 10 '22 at 14:32
would a sorted list with O(log(n)) lookup via [bisection methods](https://docs.python.org/3/library/bisect.html) be fast enough? — timgeb, Feb 10 '22 at 14:33
@timgeb Strangely, `L = [...]` `M = sorted(L)`, then `len(pickle.dumps(M))` gives exactly the same size: `169.99 MB`. Is this a coincidence that a sorted list and set has exactly the same size? — Basj, Feb 10 '22 at 14:38
`size1 = len(L) * n` is not a fair indicator of the proportion taken up by the hashtable. Compare `sys.getsizeof(L)` to `sys.getsizeof('aaaaa') * len(L)` — Mad Physicist, Feb 10 '22 at 14:38
@MadPhysicist The latter would be 611 MB; the idea is that I wanted to compare to the ideal situation. Example: all strings are stored in a .txt file (here ascii or utf8 won't change anything, it will be 1 byte per char for these alphanumeric strings), so it would take `len(L) * (n+1)`. The +1 is for a separator, eg. `\n`. — Basj, Feb 10 '22 at 14:43
@Basj. But in the real world, the data-structure of a string has some non-zero overhead: even in C it's one byte. — Mad Physicist, Feb 10 '22 at 14:46
Python dicts use aggressive data sharing so it isn’t as straightforward to compute their size in memory (they also don’t size linearly with the size of the input data!). In a bit of good news for OP, modern Python dicts are also really memory efficient. — That said, for tens of millions of strings this might not be good enough, and other data structures (compressed or probabilistic indices …) might be indicated. — Konrad Rudolph, Feb 10 '22 at 14:58
The size of a pickle is a very poor indication of the size of an object in memory. — Davis Herring, Feb 10 '22 at 16:47

Data structures with fast lookup (as close as possible to sets/hashtables) but with lower memory usage?

0 Answers0