14

I'm storing a lot of complex data in tuples/lists, but would prefer to use small wrapper classes to make the data structures easier to understand, e.g.

class Person:
    def __init__(self, first, last):
        self.first = first
        self.last = last

p = Person('foo', 'bar')
print(p.last)
...

would be preferable over

p = ['foo', 'bar']
print(p[1])
...

however there seems to be a horrible memory overhead:

l = [Person('foo', 'bar') for i in range(10000000)]
# ipython now taks 1.7 GB RAM

and

del l
l = [('foo', 'bar') for i in range(10000000)]
# now just 118 MB RAM

Why? is there any obvious alternative solution that I didn't think of?

Thanks!

(I know, in this example the 'wrapper' class looks silly. But when the data becomes more complex and nested, it is more useful)

seb314
  • 151
  • 1
  • 5
  • 2
    `collections.namedtuple` seem like they are made for this purpose, but they take around `1.1GB` for your example. Not much better. – randomir Jul 15 '17 at 22:28
  • 2
    Looks into `__slots__` or move to Python 3 for [key-sharing dictionary](https://www.python.org/dev/peps/pep-0412/). – Ashwini Chaudhary Jul 15 '17 at 22:39
  • 1
    In the case of tuples, I believe it just references the same tuple 10 million times. When you create an object, either class or a new tuple, it uses a lot more memory – Garr Godfrey Jul 15 '17 at 22:42
  • 1
    As indicated in the answers, your tuple example only creates a single tuple object. You should create a test case where you create a lot of *different* tuples vs custom objects and see how the performance is. – BrenBarn Jul 15 '17 at 22:42
  • 1
    try randomizing the values, you should get a different result. – Garr Godfrey Jul 15 '17 at 22:42
  • Related: [Is `namedtuple` really as efficient in memory usage as tuples? My test says NO](https://stackoverflow.com/q/41003081/846892) – Ashwini Chaudhary Jul 15 '17 at 22:42

4 Answers4

24

As others have said in their answers, you'll have to generate different objects for the comparison to make sense.

So, let's compare some approaches.

tuple

l = [(i, i) for i in range(10000000)]
# memory taken by Python3: 1.0 GB

class Person

class Person:
    def __init__(self, first, last):
        self.first = first
        self.last = last

l = [Person(i, i) for i in range(10000000)]
# memory: 2.0 GB

namedtuple (tuple + __slots__)

from collections import namedtuple
Person = namedtuple('Person', 'first last')

l = [Person(i, i) for i in range(10000000)]
# memory: 1.1 GB

namedtuple is basically a class that extends tuple and uses __slots__ for all named fields, but it adds fields getters and some other helper methods (you can see the exact code generated if called with verbose=True).

class Person + __slots__

class Person:
    __slots__ = ['first', 'last']
    def __init__(self, first, last):
        self.first = first
        self.last = last

l = [Person(i, i) for i in range(10000000)]
# memory: 0.9 GB

This is a trimmed-down version of namedtuple above. A clear winner, even better than pure tuples.

randomir
  • 17,989
  • 1
  • 40
  • 55
  • Thanks for the nice overview! I case anyone wonders how 2*10M integers can take up 1000M of memory, this seems to be due to the containing list + references: `import numpy as np` `l = np.array([(i, i) for i in range(10000000)])` will only take 189MB (after taking 1GB for a short time during construction). This doesn't work with the class instances though (references?). – seb314 Jul 16 '17 at 12:06
  • Actually, `np.array([(i, i) for i in range(10000000)])` will create a homogeneous 2-D array, `10000000x2`, of `dtype('int64')`, meaning the size of such array is `~ 8 x N_elem` bytes, or in this case `~160 MB`. – randomir Jul 16 '17 at 14:24
6

Using __slots__ decreases the memory footprint quite a bit (from 1.7 GB to 625 MB in my test), since each instance no longer needs to hold a dict to store the attributes.

class Person:
    __slots__ = ['first', 'last']
    def __init__(self, first, last):
        self.first = first
        self.last = last

The drawback is that you can no longer add attributes to an instance after it is created; the class only provides memory for the attributes listed in the __slots__ attribute.

Arnaud P
  • 12,022
  • 7
  • 56
  • 67
chepner
  • 497,756
  • 71
  • 530
  • 681
  • 1
    I've corrected what I thought was a 'typo' in you answer, please rollback with my apologies if it wasn't. – Arnaud P Dec 18 '17 at 14:31
  • 1
    No, the correction was valid. It's the instance of `Person` to which you can no longer add new attributes. You probably can't added attributes to `first` or `last`, either, but for entirely different reasons :) – chepner Dec 18 '17 at 15:35
2

There is yet another way to reduce the amount of memory occupied by objects by turning off support for cyclic garbage collection in addition to turning off __dict__ and __weakref__. It is implemented in the library recordclass:

$ pip install recordclass

>>> import sys
>>> from recordclass import dataobject, make_dataclass

Create the class:

class Person(dataobject):
   first:str
   last:str

or

>>> Person = make_dataclass('Person', 'first last')

As result (python 3.9, 64 bit):

>>> print(sys.getsizeof(Person(100,100)))
32

For __slot__ based class we have (python 3.9, 64 bit):

class PersonSlots:
    __slots__ = ['first', 'last']
    def __init__(self, first, last):
        self.first = first
        self.last = last

>>> print(sys.getsizeof(Person(100,100)))
48

As a result more saving of memory is possible.

For dataobject-based:

l = [Person(i, i) for i in range(10000000)]
memory size: 409 Mb

For __slots__-based:

  l = [PersonSlots(i, i) for i in range(10000000)]
  memory size: 569 Mb
intellimath
  • 2,396
  • 1
  • 12
  • 13
-1

In your second example, you only create one object, because tuples are constants.

>>> l = [('foo', 'bar') for i in range(10000000)]
>>> id(l[0])
4330463176
>>> id(l[1])
4330463176

Classes have the overhead, that the attributes are saved in a dictionary. Therefore namedtuples needs only half the memory.

Daniel
  • 42,087
  • 4
  • 55
  • 81
  • While it's true that tuples are constants, that doesn't explain the difference here. `[tuple(['foo', 'bar']) for i in range(N)]` creates N constant (but distinct) tuple objects. – vaultah Jul 15 '17 at 22:57
  • I didn't downvote, but the reason is not simply because "tuples are constant". It basically a CPython optimization that works on some kind of tuple literals, for example `(1, 2 , 3/1)` won't result in same ID in CPython 2, because 3/1 can't be constant folded in CPython 2. – Ashwini Chaudhary Jul 15 '17 at 22:59