6

Apparently integers costs 24 bytes in Python. I can understand that it does so because of extra bells and whistles of representing unbounded number. However it looks like boolean data types also cost whooping 24 bytes even though it might ever represent only two values. Why?

Edit: I'm not asking for best way to store bools. I'm already aware of NumPy, BitArray etc from other answers. My question is why, not how. Just to be clear and focused about that I've removed 2nd part of the question.

Shital Shah
  • 63,284
  • 17
  • 238
  • 185
  • 3
    1. Because booleans are integer subclasses. 2. No, if you care about that Python may not be low-level enough for you (or you need to use something like `numpy` with it). – jonrsharpe Dec 29 '15 at 00:27
  • 1
    use ctypes I guess ... but really if you are worried about this python might not be the right choice of language for this particular project ... – Joran Beasley Dec 29 '15 at 00:28
  • "I can understand that it does so because of extra bells and whistles of representing unbounded number" - nope. It's stuff like the type pointer and the reference count. On Python 2, `long`, the actual bignum type, has even *bigger* instances. – user2357112 Dec 29 '15 at 00:29
  • I don't think "deal with it" is right answer. I assume Python designers are very smart people and they must have some reason to let boolean occupy massive 24 bytes in memory. This design not only wastes memory but also slows down operations with this type. There has to be good reason. – Shital Shah Dec 29 '15 at 00:32
  • 1
    `True` and `False` are singletons, so it's not wasting very *much* memory! – jonrsharpe Dec 29 '15 at 00:36
  • @jonrsharpe - I was suspecting it might be subclass of integer but still it doesn't make sense. Interpreter can always dynamically cast boolean to int when needed. There is no reason to waste so much memory right of the bat. – Shital Shah Dec 29 '15 at 00:36
  • 2
    Also relevant: http://stackoverflow.com/q/8169001/3001761, **which answers what you've changed your question to** from a historical perspective. – jonrsharpe Dec 29 '15 at 00:40

1 Answers1

5

A bool may be pretty huge for what it represents, but there are only two of them. A list full of Trues only contains 4- or 8-byte references to the one canonical True object.

If 8 bytes is still too big, and you really want to use Python for whatever it is you're doing, you could consider using an array type like that provided by the built-in array module or NumPy. These offer 1-byte-per-bool representations. If this is still too much, you could use a bitset, either manually with Python's built-in bignums or with something like BitVector from PyPI. These options are likely to slow your program way down. Some of them can offer speed improvements, but only if you take advantage of features that let you push work out of interpreted code and into C.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • 1
    I know the alternatives of using NumPy, BitVector etc. I'm trying to figure out the internal reasoning that Python designers used to let boolean occupy 24 bytes. Most of the things in Python is very well thought out but this... – Shital Shah Dec 29 '15 at 00:38
  • 2
    @ShitalShah: If your program is so pressed for space that you need to worry about the 48 bytes consumed by `True` and `False`, you're probably not working on a machine with an operating system, let alone a Python interpreter. – user2357112 Dec 29 '15 at 00:39
  • @ShitalShah again, if you're worried about this sort of thing, *stop using Python!* There are many lower-level languages if you want to do this kind of micro-optimisation; that's not a design goal for Python. – jonrsharpe Dec 29 '15 at 00:40
  • 5
    @johnsharpe - I'm not "worried" about this. The question is simply trying to understand pro and cons of language design decision. I've searched enough and I can't find anyone giving insights in to this design decision. This is not a duplicate question. – Shital Shah Dec 29 '15 at 00:44
  • 7
    @ShitalShah: As everyone has noted, `True` and `False` are singleton objects, so the incremental cost to store an additional boolean value is just the pointer to reference them (4 or 8 bytes). The advantage to the overhead is consistent internal representation of Python objects. They all share a common header, and that common header includes some fixed overhead; a pointer to the type object (4-8 bytes), a reference count (4-8 bytes) and in the case of `bool`, the actual value, which with structure padding adds another 4-8 bytes. The common structure means they can be handled uniformly in C. – ShadowRanger Dec 29 '15 at 02:07