7

I have a large structure of primitive types within nested dict/list. The structure is quite complicated and doesn't really matter.

If I represent it in python's built-in types (dict/list/float/int/str) it takes 1.1 GB, but if I store it in protobuf and load it to memory it is significantly smaller. ~250 MB total.

I'm wondering how can this be. Are the built-in types in python inefficient in comparison to some external library?

Edit: The structure is loaded from json file. So no internal references between objects

user972014
  • 3,296
  • 6
  • 49
  • 89
  • 1
    PBs are a binary serialisation format, for transferring and storing data. Python objects need to not only store their data but also code for methods, internal structures etc to be useful. You're comparing apples and oranges. – snakecharmerb Aug 16 '20 at 18:15
  • It could help to know how much data does the structure really has? I.e. how many bytes in primitive values, and how many pointers to other structures? – jpa Aug 17 '20 at 05:56
  • It is a loaded json. So no pointers at all. The son on disk is around 145MB – user972014 Aug 17 '20 at 07:30
  • @snakecharmerb I understand. But it's still a very significant difference that I can't explain. The methods themselves are obviously loaded only once. Is there a different object with less "internal structures"? – user972014 Aug 17 '20 at 08:11

2 Answers2

10

"Simple" python objects, such as int or float, need much more memory than their C-counterparts used by protobuf.

Let's take a list of Python integers as example compared to an array of integers, as for example in an array.array (i.e. array.array('i', ...)).

The analysis for array.array is simple: discarding some overhead from the array.arrays-object, only 4 bytes (size of a C-integer) are needed per element.

The situation is completely different for a list of integers:

  • the list holds not the integer-objects themselves but pointers to the objects (8 additional bytes for a 64bit executable)
  • even a small non-zero integer needs at least 28 bytes (see import sys; sys.getsizeof(1) returns 28): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the integer-function table, 8 bytes are needed for the size of the integer value (Python's integers can be much bigger than 2^32), and at least 4 byte to hold the integer value itself.
  • there is also an overhead for memory management of 4.5 bytes.

This means there is a whopping cost of 40.5 bytes per Python integer compared to the possible 4 bytes (or 8 bytes if we use long long int, i.e. 64bit integers).

A situation is similar for a list with Python floats compared to an array of doubles( i.e. array.array('d',...)), which only needs about 8 bytes per element. But for list we have:

  • the list holds not the float objects themselves but pointers to the objects (8 additional bytes for a 64bit executable)
  • a float object needs 24 bytes (see import sys; sys.getsizeof(1.0) returns 24): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the float-function table, and 8 bytes to hold the double-value itself.
  • because 24 is a multiple of 8, the overhead for memory management is "only" about 0.5 bytes.

Which means 32.5 bytes for a Python float object vs. 8 byte for a C-double.

protobuf uses internally the same representation of the data as array.array and thus needs much less memory (about 4-5 times less, as you observe). numpy.array is another example for a data type, which holds raw C-values and thus needs much less memory than lists.


If one doesn't need to search in a dictionary, then saving the key-values-pairs in a list will need less memory than in a dictionary, because one doesn't have to maintain a structure for searching (which imposes some memory costs) - this is also another thing that leads to smaller memory footprint of protobuf-data.


To answer your other question: There are no built-in modules which are to Python-dict, what array.array are to Python-list, so I use this opportunity to shamelessly plug-in an advertisement for a library of mine: cykhash.

Sets and maps from cykhash need less than 25% of Python'S-dict/set memory but are about the same fast.

TeaRex
  • 103
  • 1
ead
  • 32,758
  • 6
  • 90
  • 153
0

This is normal and it's all about space vs. time tradeoff. Memory layout depends on the way how a particular data structure is implemented, which in turn depends on how it is going to be used.

A general-purpose dictionary is typically implemented with a hashtable. It has a fixed-size list of buckets that store key-value pairs. The number of items in a dictionary can be smaller, equal or bigger that number of buckets. If smaller, space is wasted. If bigger, dictionary operations take a long time. A hashtable implementation usually starts with a small initial bucket list, then grow it as new items are added to keep the performance decent. However, resizing also requires rehashing which is computationally very expensive, so whenever you do it, you want to leave some room for growth. General-purpose dictionaries are a trade-off between space and time because they don't "know" how many elements they are supposed to contain and because there is no perfect hash function. But in a good-enough case, a general-use hashtable will give you near-O(1) performance.

When data is serialized it's a different story. Data in transit does not change, you are not doing lookups with it, it is not subjected to garbage collection, boundary alignment and so on. This means you can simply pack keys and values one after another for space efficiency. You need virtually no metadata and no control structures as long as the values can be reconstructed. On the downside, manipulating packed data is very slow because all operations take O(n) time.

For this reason, you will almost always want to:

  • convert data from time-efficient into space-efficient format before sending it
  • convert data from space-efficient into time-efficient format after receiving it.

If you are using nested dictionaries (or lists, which are in many ways similar), the differences will add up and become even more pronounced. When you know the number of items in advance and the data does not change much, you can probably get some improvement by preallocating the memory for it, such as dict.fromkeys(range(count)).

jurez
  • 4,436
  • 2
  • 12
  • 20