Set of 10-char strings in Python is 10 times bigger in RAM as expected

Question

See also: Memory usage of a list of millions of strings in Python and the duplicate.

I'm creating a Python set in RAM containing 10 millions of 10-char strings (each char can be expressed inside 0-255, not more complicated than this, no UTF8 non-ASCII complex char).

I thought it should use around ~ 10M * 10 = 100 MB, maybe +50% for data structure, i.e. max 150 MB. But instead the process uses ... 993 MB!

(As a comparison, when I run the same script with 10 elements instead of 1010001000, the process only uses 4 MB in RAM.)

How to make a lighter-in-RAM set of 10-char strings?

import random

def randstr():  # random string of length 10
    return ''.join(random.choice('abcdefghijklmnopqrstruvwxyz') for i in range(10))

s = set()

for i in range(10*1000*1000):
    s.add(randstr())

Note: I would like to keep, if possible, a set() and not another database (like SQLite) because of very fast membership lookup.

Note: I'm using Python 2.7 64-bit

Note: In my real program it's in fact 100 - 500 millions items that's why I want to save memory.

@Sraw yes `sys.getsizeof(s)` also gave me smaller result than the real memory used by process. Probably it doesn't count the actual data content but only the set structure size? — Basj, Jan 04 '18 at 10:19
It seems `set` has a bigger memory usage than lists: https://stackoverflow.com/a/22589939/6464041 You could also use another datastructure like the [radix tree](https://en.wikipedia.org/wiki/Radix_tree) — Gábor Fekete, Jan 04 '18 at 10:21
@Sraw I actually think your deleted answer might work if fast membership lookup is the only requirement. Sort the strings and do a binary search. — Stefan Pochmann, Jan 04 '18 at 10:52
@StefanPochmann this looks promising, how to do it with Python? (Do you think lookup will be as fast as set lookup?) — Basj, Jan 04 '18 at 11:37
@Basj Not as fast as set lookup, but "< 50 microseconds" should be feasible with this, I think (based on some tests I did earlier). — Stefan Pochmann, Jan 04 '18 at 12:05
@StefanPochmann Thanks, but I just run out of time to give a detailed answer. After reading your answer, it is great. — Sraw, Jan 05 '18 at 00:53

Stefan Pochmann · Accepted Answer · 2018-01-04T14:28:56.993

With standard sets/strings that's simply what you get, since there's significant overhead in Python. But you could use a different data structure. Here I propose one that not only doesn't take much more space than your raw data but that actually takes less space than your raw data and that is fairly fast and simple.

New Data Structure

In a previous comment you said "the only set operation I need is membership lookup "test" in s" and "I'm expecting < 50 microseconds per lookup for a 100M items set".

So here's an idea: Split each string into a prefix of length p and a suffix (of length 10-p). And put your data into a dict suffixes mapping prefixes to space-separated suffixes. So for example if you used p = 4 and had only the two strings "abcdefghij" and "abcdklmnop", then your dict would be {'abcd': 'efghij klmnop'}. And you look strings up with word[p:] in suffixes[word[:p]].

Space

I recommend p = 4. Then you have 26⁴ = 456976 prefixes and with 500 million strings you have about 1094 suffixes per prefix.

With 10 million strings and p = 4:

25 MB for the dict itself
18 MB for the dict keys (i.e., the prefixes)
86 MB for the dict values (i.e., the suffixes)
130 MB total

With 20 million strings and p = 4:

25 MB for the dict itself
18 MB for the dict keys (i.e., the prefixes)
156 MB for the dict values (i.e., the suffixes)
200 MB total

As expected, the size of the dict itself as well as its keys=prefixes didn't change, but the size for the values=suffixes increased by 70 MB. As it should, since each prefix adds seven bytes (six letters plus one space) and we added ten million.

=> For 500 million strings, expect 3560 MB, which is 7.12 bytes per string. So even less than the 10 bytes per string you expected.

Time

I can't test 500 million words, so I used n = 739645 and p = 2 and used length 8 for the random words. That gives me 739645/26² = 1094 suffixes of length 6 per prefix, same as the full test would. Then I measured the time to look up a million random words. On repl.it (which uses 64-bit Python) it took 12 microseconds per lookup. On my PC (where I use 32-bit Python) it took 2.1 microseconds per lookup. So I estimate you'll need around 5 microseconds per lookup, about 10 times as fast as you required.

Optimizations

You could make it even smaller and faster by not space-separating the suffixes but by using CamelCase. So for the above small example you'd have {'abcd': 'EfghijKlmnop'}. Then you'd have 6.12 bytes per string, and lookup speed should improve by factor 7/6.

You could also take advantage of the small alphabet. With six letters you have 26⁶ possibilities, which take only log₂₅₆(26⁶) ≈ 3.53 bytes. So you could convert each six-letters suffix to just four bytes. CamelCase wouldn't work anymore, but with space-separation you'd still only store five bytes per suffix, so overall 5.12 bytes per string. Lookup speed of already converted suffixes should improve by factor 7/5 compared to the original idea, though the conversion might take a significant amount of time now.

Hmm, actually, p = 5 might be better. It should take about the same amount of memory overall and be much faster, since instead of 1094 suffixes you'd only group 42 suffixes, and searching these suffix group strings is what takes most of the time. For 500 million strings I think the total size would rise from 3560 MB to about 4180. On my PC, lookups average 0.32 microseconds then (simulated with 739645 strings of length 8 and using p = 3). And since five letters fit into three bytes, you could save 1000 MB on keys and 1000 MB on the suffixes and end up with 2180 MB total, which would be 4.36 bytes per string. Though the conversion to three-bytes strings might take longer than than the actual lookups. You could create another table storing the conversions so you could look them up, which would be fast but cost significant extra memory again. Another idea: The keys=prefixes could be converted to ints from 0 to 26⁵. That should save about 200 MB and might be fast.

Test Code

My test code I used to produce the above outputs:

import random
from collections import defaultdict
from sys import getsizeof as sizeof
from timeit import timeit

# Settings: Number of strings and prefix length
n = 10*1000*1000
p = 4

def randstr():  # random string of length 10
    return ''.join(random.choice('abcdefghijklmnopqrstruvwxyz') for i in range(10))

# Build and store the test data (n random strings)
suffixes = defaultdict(str)
for i in xrange(n):
    word = randstr()
    prefix, suffix = word[:p], word[p:]
    suffixes[prefix] += suffix + ' '

# Show the sizes
dict_size = sizeof(suffixes)
keys_size = sum(map(sizeof, suffixes.keys()))
values_size = sum(map(sizeof, suffixes.values()))
print dict_size / 10**6, 'MB for the dict itself'
print keys_size / 10**6, 'MB for the dict keys (i.e., the prefixes)'
print values_size / 10**6, 'MB for the dict values (i.e., the suffixes)'
print (dict_size + keys_size + values_size) / 10**6, 'MB total'

# Speed test
words = [randstr() for _ in xrange(10**6)]
t = timeit(lambda: sum(word[p:] in suffixes[word[:p]] for word in words), number=1)
print '%.2f microseconds per lookup' % t

@Basj In case you looked at it early, note I made quite a few edits, in particular I added/edited the whole "Optimizations" section and corrected "milliseconds" to "microseconds". I'm btw curious how this works out with the full huge data on your machine, so if you can tell me, I'm interested :-) — Stefan Pochmann, Jan 04 '18 at 14:21
Thanks once again @StefanPochmann, I will do my tests today and tomorrow and will tell you the results :) Last thing: how would you compare (in a way that could be compared to your last `timeit` command), this solution to standard set membership lookup? I'm curious if the price to pay for having RAM divded by 5 is lookup time x 5. — Basj, Jan 04 '18 at 14:44
I'd say build as large a set as you can (the way you already did), and then do `timeit(lambda: sum(word in s for word in words), number=1)`. I just tested that on my PC with a set of 10 million words and it took 0.25 seconds for the million test words, i.e., 0.25 microseconds per lookup. That's almost no better than the 0.32 microseconds I mentioned for my p=5 test. Actually makes me wonder whether I did that right :-) — Stefan Pochmann, Jan 04 '18 at 14:56
Thanks @StefanPochmann. I just learnt from your code that I can do `timeit(code_involving_a_variable_previously_defined, number=1)`, very cool! I always thought one has to do `timeit('code_involving_x', 'setup_define_variable_x_here', number=1)`. — Basj, Jan 04 '18 at 15:09
@Basj Note that that way with strings is better for really small expressions that are really fast. The way with a function adds overhead to the timing. But in a test like this one here, that overhead is irrelevantly small. You can btw also do something like `timeit('code_involving_x', 'from __main__ import x', number=1)`. — Stefan Pochmann, Jan 04 '18 at 15:17
BTW, is there a name for your new data structure? Like a "prefix-suffix tree" or any other name in CS? — Basj, Jan 05 '18 at 09:04
@Basj I'm not aware of a name for it. I don't know it from elsewhere, I just came up with it here. Closest thing I can think of is [tries](https://en.wikipedia.org/wiki/Trie), I guess it's somewhat a two-level trie with an alphabet where a "character" is multiple characters and where the sibling leaves are stored in one string instead of in separate nodes. — Stefan Pochmann, Jan 05 '18 at 14:36

score 1 · Answer 2 · answered Jan 04 '18 at 10:23

Ten byte stings are very short. So the overhead is large. Each string takes 47 bytes. In addition each set element takes room for hash value and pointers, with at least 24 bytes, plus 30% to 50% free entries, needed for an efficient hashset.

For your numbers:

470MB for the strings
240MB for the hashes
120MB free room

830MB at minimum plus some working space. There is no way with python to get these numbers down.