2

I'd like to build a set S of tens of millions of short alphanumeric strings (length ~ 10 on average, max 100). Goal: do a quick lookup if a given string s is in S.

A set/hashtable seems to be interesting because it allows a fast O(1) lookup. However, I know that a set takes much more in RAM than the actual data (because of the actual internal hashtable/buckets). In the following test, I see that all 5-character strings aaaaa, aaaab, ..., zzzzz take 57 MB for the content, but 170 MB as a Python set (when pickled):

import string, pickle
from itertools import product
n = 5
L = [''.join(p) for p in product(string.ascii_lowercase, repeat=n)]  # aaaaa, aaaab, aaaac, ..., zzzzz
L = set(L)
size1 = len(L) * n
size2 = len(pickle.dumps(L))
print("Data: %.2f MB" % (size1 / 1024**2))
print("Size of set: %.2f MB" % (size2 / 1024**2))
# Data: 56.65 MB
# Size of set: 169.99 MB

Are there other data structures (available in Python) which allow fast lookup (maybe not as fast as a set which might be optimal), without having the burden of a 1:3 factor in the size of the actual content in bytes vs. data structure size in bytes?

TL;DR: I'd like to see if a solution exists somewhere between just the raw bytes of the content (optimal in space, but very slow search), and a set/hashtable (optimal for search speed, but uses large memory)

Basj
  • 41,386
  • 99
  • 383
  • 673
  • https://en.wikipedia.org/wiki/Trie would be my first impulse here. you'd probably need a memory efficient non-object-oriented implementation, because otherwise each node object is probably larger than the string it stores. – timgeb Feb 10 '22 at 14:11
  • Yes this is the problem @timgeb, I've already used a trie in the past, but here I fear the objects will take a lot of memory for themselves. – Basj Feb 10 '22 at 14:15
  • 1
    When I run your code, the inflation isn't nearly as drastic: goes from 56.65MB to 90.68MB. – Scott Hunter Feb 10 '22 at 14:20
  • @ScottHunter I'm on Python 3.7.6 Windows 64-bit, and you? Maybe it has been improved since 3.7. – Basj Feb 10 '22 at 14:32
  • 1
    would a sorted list with O(log(n)) lookup via [bisection methods](https://docs.python.org/3/library/bisect.html) be fast enough? – timgeb Feb 10 '22 at 14:33
  • I used version 3.8.10, so I guess it has. – Scott Hunter Feb 10 '22 at 14:36
  • @timgeb Strangely, `L = [...]` `M = sorted(L)`, then `len(pickle.dumps(M))` gives exactly the same size: `169.99 MB`. Is this a coincidence that a sorted list and set has exactly the same size? – Basj Feb 10 '22 at 14:38
  • 1
    `size1 = len(L) * n` is not a fair indicator of the proportion taken up by the hashtable. Compare `sys.getsizeof(L)` to `sys.getsizeof('aaaaa') * len(L)` – Mad Physicist Feb 10 '22 at 14:38
  • @MadPhysicist The latter would be 611 MB; the idea is that I wanted to compare to the ideal situation. Example: all strings are stored in a .txt file (here ascii or utf8 won't change anything, it will be 1 byte per char for these alphanumeric strings), so it would take `len(L) * (n+1)`. The +1 is for a separator, eg. `\n`. – Basj Feb 10 '22 at 14:43
  • @Basj. But in the real world, the data-structure of a string has some non-zero overhead: even in C it's one byte. – Mad Physicist Feb 10 '22 at 14:46
  • Python dicts use aggressive data sharing so it isn’t as straightforward to compute their size in memory (they also don’t size linearly with the size of the input data!). In a bit of good news for OP, modern Python dicts are also really memory efficient. — That said, for tens of millions of strings this might not be good enough, and other data structures (compressed or probabilistic indices …) might be indicated. – Konrad Rudolph Feb 10 '22 at 14:58
  • The size of a pickle is a very poor indication of the size of an object in memory. – Davis Herring Feb 10 '22 at 16:47

0 Answers0