5

I am trying to reduce the memory-consumption of a python dict, which in my case serves as a word-->document_id "inverted index". Each word is hashed as an integer, which takes up 24 bytes.

I was wondering if I can convert each element within dict's values and each key within dict to a bitarray instead. I've noticed that the max value of any encountered int is less than 2^22, so I can maybe just allocate a bit-array of "size 22".

How can this be done? So far I've seen gmpy2 and bitarray libraries, as well as std::bitset in the C++ stdlib, which I can use with Cython. I've read from this post that bitarray is not as fast as gmpy. In gmpy, I am not sure how to set the size. Finally, I wonder if the memory-overhead of gmpy or bitarray objects in Python is worth it, when I can just use std::bitset, which probably uses the least memory of all.

Community
  • 1
  • 1
richizy
  • 2,002
  • 3
  • 21
  • 26

1 Answers1

3
>>> sys.getsizeof(1)
24

That's 24 bytes, just for an integer, on my machine. For Python objects, object overhead is going to swamp all other costs for such small values. Even if you could shave off 10 bits (which you can't; memory allocation doesn't work that way), it's peanuts compared to the refcount, the type pointer, and the dict entry itself (which isn't included in that 24 number), even before you get to the actual data.

bitarray isn't going to help; it's probably bigger than an int. I don't know about std::bitset, since I'm not sure what overhead Cython adds to it, but it's going to have overhead for things like the bit count. If anything, I'd expect a Cython int to work best, but it probably needs to be converted to a regular Python int to go in a dict.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • I can use `std::unordered_map` to store C-typed `unsigned int`. But you're saying it's just better to implement it without bitarrays..? – richizy Feb 19 '14 at 00:49
  • @richizy: Bitarrays won't help. I think an `unordered_map` may be your best shot. – user2357112 Feb 19 '14 at 00:50
  • okay.. I never realized just how insane how much memory references to Python primitives can take up – richizy Feb 19 '14 at 00:51