1

I'm looking for a set-like data structure in Python that allows a fast lookup (O(1) for sets), for 100 millions of short strings (or bytes-strings) of length ~ 10.

With 10M strings, this already takes 750 MB RAM on Python 3.7 or 3.10.2 on (or 900 MB if we replace the b-strings by strings):

S = set(b"a%09i" % i for i in range(10_000_000))  # { b"a000000000", b"a000000001", ... }

whereas the "real data" here is 10 bytes * 10M ~ 100 MB. So there is a 7.5x memory consumption factor because of the set structure, pointers, buckets... (for a study about this in the case of a list, see the answer of Memory usage of a list of millions of strings in Python).

When working with "short" strings, having pointers to the strings (probably taking 64 bit = 8 bytes) in the internal structure is probably already responsible for a 2x factor, and also the buckets structure of the hash-table, etc.

Are there some "short string optimizations" techniques allowing to have a memory-efficient set of short bytes-strings in Python? (or any other structure allowing fast lookup/membership test)

Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.

Or would using a bisect or a sorted list help (lookup in O(log n) might be ok), while keeping memory usage small? (smaller than a 7.5x factor as with a set)

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 3
    It's worth trying out a database (e.g. `sqlite3`), just to see how they fare, and also things like key-value stores - this thread sounds interesting. https://stackoverflow.com/questions/47233562/key-value-store-in-python-for-possibly-100-gb-of-data-without-client-server – Tomalak Feb 22 '22 at 11:17

2 Answers2

3

Up to now here are the methods that I tested thanks to comments, and that seem working.

Sorted list + bisection search (+ bloom filter)

  • Insert everything in a standard list L, in sorted order. This takes a lot less memory than a set.

  • (optional) Create a Bloom filter, here is a very small code to do it.

  • (optional) First test membership with Bloom filter (fast).

  • Check if it really is a match (and not a false positive) with the fast in_sorted_list() from this answer using bisect, much faster than a standard lookup b"hello" in L.

If the bisection search is fast enough, we can even bypass the bloom filter (steps 2 and 3). It will be O(log n).

In my test with 100M strings, even without bloom filter, the lookup took 2 µs on average.

Sqlite3

As suggested by @tomalak's comment, inserting all the data in a Sqlite3 database works very well. Querying if a string exists in the database was done in 50 µs on average on my 8 GB database, even without any index.

Adding an index made the DB grow to 11 GB, but then the queries were still done in ~50 µs on average, so no gain here.

Edit: as mentioned in a comment, using CREATE TABLE t(s TEXT PRIMARY KEY) WITHOUT ROWID; even made the DB smaller: 3.3 GB, and the queries are still done in ~50 µs on average. Sqlite3 is (as always) really amazing.

In this case, it's even possible to load it totally in RAM with the method from How to load existing db file to memory in Python sqlite3?, and then it's ~9 µs per query!

Bisection in file with sorted lines

Working, and with very fast queries (~ 35 µs per query), without loading the file in memory! See Bisection search in the sorted lines of an opened file (not loaded in memory)

Dict with prefixes as keys and concatenation of suffixes as values

This is the solution described here: Set of 10-char strings in Python is 10 times bigger in RAM as expected.

The idea is: we have a dict D and, for a given word,

prefix, suffix = word[:4], word[4:]
D[prefix] +=  suffix + b' '

With this method, the RAM space used is even smaller than the actual data (I tested with 30M of strings of average length 14, and it used 349 MB), the queries seem very fast (2 µs), but the initial creation time of the dict is a bit high.

I also tried with dict values = list of suffixes, but it's much more RAM-consuming.

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 1
    Regarding the SQLite DB. You should see an improvement if you define the table as a clustered index (`CREATE TABLE strings (string TEXT PRIMARY KEY) WITHOUT ROWID;` - [docs](https://www.sqlite.org/withoutrowid.html)). This should not increase the DB size and (probably) be even faster. In fact it should result in a smaller DB size. – Tomalak Feb 22 '22 at 15:12
1

Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.

While not being a set data-structure but rather a list, I think pyarrow has a quite optimized way of storing a large number of small strings. There is a pandas integration as well which should make it easy to try it out: https://pythonspeed.com/articles/pandas-string-dtype-memory/

user2505961
  • 148
  • 10
  • Interesting! But if it is a list, the test `b'hello' in L` will be far slower than a set, what do you think? – Basj Feb 22 '22 at 11:01
  • Well yes, that for sure. Depending on your application you might get away with using probabilistic data structures like a bloom-filter: https://python.land/bloom-filter – user2505961 Feb 22 '22 at 11:12
  • Actually you can use the bloom filter independently of your application. Use it to pre-filter your data and only run the more expensive but accurate membership check for the positives. This is necessary since the bloom filter can give you false positives – user2505961 Feb 22 '22 at 11:26
  • I'm currently trying this @user2505961, and the bloom filter works great for a pre-filtering. To do the final check, I currently do `b"hello" in L` but this is quite slow in my case (1 sec per lookup). Is there a way to work with a sorted list (O(log n) lookup) instead of a list in Python? – Basj Feb 22 '22 at 11:38
  • You can use the list as an open-addressed hash table: https://en.wikipedia.org/wiki/Open_addressing – Matt Timmermans Feb 22 '22 at 14:05