With standard sets/strings that's simply what you get, since there's significant overhead in Python. But you could use a different data structure. Here I propose one that not only doesn't take much more space than your raw data but that actually takes less space than your raw data and that is fairly fast and simple.
New Data Structure
In a previous comment you said "the only set operation I need is membership lookup "test" in s
" and "I'm expecting < 50 microseconds per lookup for a 100M items set".
So here's an idea: Split each string into a prefix
of length p
and a suffix
(of length 10-p
). And put your data into a dict suffixes
mapping prefixes to space-separated suffixes. So for example if you used p = 4
and had only the two strings "abcdefghij"
and "abcdklmnop"
, then your dict would be {'abcd': 'efghij klmnop'}
. And you look strings up with word[p:] in suffixes[word[:p]]
.
Space
I recommend p = 4
. Then you have 264 = 456976 prefixes and with 500 million strings you have about 1094
suffixes per prefix.
With 10 million strings and p = 4
:
25 MB for the dict itself
18 MB for the dict keys (i.e., the prefixes)
86 MB for the dict values (i.e., the suffixes)
130 MB total
With 20 million strings and p = 4
:
25 MB for the dict itself
18 MB for the dict keys (i.e., the prefixes)
156 MB for the dict values (i.e., the suffixes)
200 MB total
As expected, the size of the dict itself as well as its keys=prefixes didn't change, but the size for the values=suffixes increased by 70 MB. As it should, since each prefix adds seven bytes (six letters plus one space) and we added ten million.
=> For 500 million strings, expect 3560 MB, which is 7.12 bytes per string. So even less than the 10 bytes per string you expected.
Time
I can't test 500 million words, so I used n = 739645
and p = 2
and used length 8 for the random words. That gives me 739645/262 = 1094 suffixes of length 6 per prefix, same as the full test would. Then I measured the time to look up a million random words. On repl.it (which uses 64-bit Python) it took 12 microseconds per lookup. On my PC (where I use 32-bit Python) it took 2.1 microseconds per lookup. So I estimate you'll need around 5 microseconds per lookup, about 10 times as fast as you required.
Optimizations
You could make it even smaller and faster by not space-separating the suffixes but by using CamelCase. So for the above small example you'd have {'abcd': 'EfghijKlmnop'}
. Then you'd have 6.12 bytes per string, and lookup speed should improve by factor 7/6.
You could also take advantage of the small alphabet. With six letters you have 266 possibilities, which take only log256(266) ≈ 3.53 bytes. So you could convert each six-letters suffix to just four bytes. CamelCase wouldn't work anymore, but with space-separation you'd still only store five bytes per suffix, so overall 5.12 bytes per string. Lookup speed of already converted suffixes should improve by factor 7/5 compared to the original idea, though the conversion might take a significant amount of time now.
Hmm, actually, p = 5
might be better. It should take about the same amount of memory overall and be much faster, since instead of 1094 suffixes you'd only group 42 suffixes, and searching these suffix group strings is what takes most of the time. For 500 million strings I think the total size would rise from 3560 MB to about 4180. On my PC, lookups average 0.32 microseconds then (simulated with 739645 strings of length 8 and using p = 3). And since five letters fit into three bytes, you could save 1000 MB on keys and 1000 MB on the suffixes and end up with 2180 MB total, which would be 4.36 bytes per string. Though the conversion to three-bytes strings might take longer than than the actual lookups. You could create another table storing the conversions so you could look them up, which would be fast but cost significant extra memory again. Another idea: The keys=prefixes could be converted to ints from 0 to 265. That should save about 200 MB and might be fast.
Test Code
My test code I used to produce the above outputs:
import random
from collections import defaultdict
from sys import getsizeof as sizeof
from timeit import timeit
# Settings: Number of strings and prefix length
n = 10*1000*1000
p = 4
def randstr(): # random string of length 10
return ''.join(random.choice('abcdefghijklmnopqrstruvwxyz') for i in range(10))
# Build and store the test data (n random strings)
suffixes = defaultdict(str)
for i in xrange(n):
word = randstr()
prefix, suffix = word[:p], word[p:]
suffixes[prefix] += suffix + ' '
# Show the sizes
dict_size = sizeof(suffixes)
keys_size = sum(map(sizeof, suffixes.keys()))
values_size = sum(map(sizeof, suffixes.values()))
print dict_size / 10**6, 'MB for the dict itself'
print keys_size / 10**6, 'MB for the dict keys (i.e., the prefixes)'
print values_size / 10**6, 'MB for the dict values (i.e., the suffixes)'
print (dict_size + keys_size + values_size) / 10**6, 'MB total'
# Speed test
words = [randstr() for _ in xrange(10**6)]
t = timeit(lambda: sum(word[p:] in suffixes[word[:p]] for word in words), number=1)
print '%.2f microseconds per lookup' % t