1

I have a python list, which consists of 80000 lists. Each of these inner lists more or less have this format:

["012345", "MYNAME" "Mon", "A", 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]

Could you tell approximately how much memory would this list consisting of 80000 lists consume?

And is it common/OK to use and operate on lists that big in python? Most of the operations I do is to extract data from this list with list comprehension method.

Actually, what I would like to learn is: is python fast enough to extract data from that big lists using list comprehension methods. I want my script to be fast

alwbtc
  • 28,057
  • 62
  • 134
  • 188
  • Do you have an actual problem with your memory? Large lists should be fine, but it depends on what you are trying to do. As it stands, there is no real question here. – Martijn Pieters Jan 21 '13 at 21:16
  • Related / possible duplicate: [How do I determine the size of an object in Python?](http://stackoverflow.com/q/449560) – Martijn Pieters Jan 21 '13 at 21:17
  • What I would like to learn is: is python fast enough to extract data from that big lists using list comprehension methods. I want my script to be fast. – alwbtc Jan 21 '13 at 21:18
  • Added this to my question. – alwbtc Jan 21 '13 at 21:19
  • 1
    Define 'fast'. Define what operations you wanted to execute. Define what is acceptable. What have you tried? Have you run into problems that we can help you with? – Martijn Pieters Jan 21 '13 at 21:20
  • As I said, I extract new lists from this large list using list comprehensions, and write this extracted data in a text file in a specific format. – alwbtc Jan 21 '13 at 21:23
  • 2
    It's the extractions that matter here. You have 10-12 MB of data here, which is peanuts. Yet you can easily kill performance by filtering incorrectly. Do your own tests with `timeit`, we cannot tell you if things will be fast or not. – Martijn Pieters Jan 21 '13 at 21:25
  • It is nice to hear this needs little memory :) – alwbtc Jan 21 '13 at 21:42
  • Calculating the size of an arbitrary Python object is complicated. See the question [Is there a memory profiler for python2.7?](http://stackoverflow.com/questions/4416654/is-there-a-memory-profiler-for-python2-7) and more to the point, the Activestate recipe [Size of Python objects (revised)](http://code.activestate.com/recipes/546530-size-of-python-objects-revised/). – martineau Jan 21 '13 at 22:19

5 Answers5

3
In [39]: lis=["012345", "MYNAME" "Mon", "A", 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
     20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]

In [40]: k=[lis[:] for _ in xrange(80000)]

In [41]: k.__sizeof__()
Out[41]: 325664

In [42]: sys.getsizeof(k)  #after gc_head
Out[42]: 325676

As per the code in sysmodule.c it looks like it calls __sizeof__ method to get the size of an object.

   837   method = _PyObject_LookupSpecial(o, &PyId___sizeof__);   
   838     if (method == NULL) {
   839         if (!PyErr_Occurred())
   840             PyErr_Format(PyExc_TypeError,
   841                          "Type %.100s doesn't define __sizeof__",
   842                          Py_TYPE(o)->tp_name);
   843     }
   844     else {
   845         res = PyObject_CallFunctionObjArgs(method, NULL);
   846         Py_DECREF(method);
   847     }

and then adds some gc overhead to it:

   860     /* add gc_head size */
   861     if (PyObject_IS_GC(o)) {
   862         PyObject *tmp = res;
   863         res = PyNumber_Add(tmp, gc_head_size);
   864         Py_DECREF(tmp);
   865     }
   866     return res;
   867 }

We can also use the recursive sizeof recipe as suggested in docs to recursively calculate the size of each container:

In [17]: total_size(k)  #from recursive sizeof recipe
Out[17]: 13125767

In [18]: sum(y.__sizeof__() for x in k for y in x)
Out[18]: 34160000
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • @alwbtc 1 MB = 2**20 bytes; 12800000./2**20=12.20703125. – Matteo Italia Jan 21 '13 at 21:20
  • My server has more than 96 GB memory, then it is OK I think. – alwbtc Jan 21 '13 at 21:24
  • 1
    -1: Note first that this depends on the operating system, (and possibly version of Python). On mine the same code gives a size of `328`. Secondly, it doesn't necessarily scale linearly. Try `sys.getsizeof([lis[:] for _ in xrange(80000)])`. – David Robinson Jan 21 '13 at 21:24
  • @alwbtc 12800000.0/(1024*1024 == 12.20703125 MB – Ashwini Chaudhary Jan 21 '13 at 21:24
  • 2
    A list of N lists would very likely not be the same size as N independent lists. – martineau Jan 21 '13 at 22:07
  • You want `sys.getsizeof(lis) + sum(map(sys.getsizeof, lis))` or something similar. The implementation given in the answer is broken. – Noctis Skytower Jan 21 '13 at 22:43
  • @DavidRobinson I guess `sum(lis.__sizeof__() for _ in xrange(80000))` is better option here.I am not a `C` programmer, but, looking at the [sysmodule.c](http://hg.python.org/cpython/file/6110ed94f86d/Python/sysmodule.c#l813) file, it seems like it calls the method `__sizeof__` to get the size of a particular object. – Ashwini Chaudhary Jan 21 '13 at 22:52
  • That's much better. BTW I didn't downvote your answer (although I was tempted before your update). – martineau Jan 21 '13 at 23:49
3

On my machine using 32-bit Python 2.7.3, a list containing 80K copies of the exact list in your question takes about 10MB. This was measured by comparing the memory footprints of two otherwise identical interpreters, one with the list and one without.

I have tried measuring the size with sys.getsizeof(), but that returned a clearly incorrect result:

>>> l=[["012345", "MYNAME" "Mon", "A", 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20] for i in range(80000)]
>>> sys.getsizeof(l)
325680
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • I don't think `getsizeof` recurses. The outer list-of-lists just stores pointers, and is thus not very large. You want to add the size of the contained lists too (80k * the size of the first). – Martijn Pieters Jan 21 '13 at 21:22
  • @MartijnPieters: That must be what happens. – NPE Jan 21 '13 at 21:22
1

sys.getsizeof: (object, default)
│ │ getsizeof(object, default) -> int
│ │
│ │ Return the size of object in bytes.

Code

>> import sys
>> sys.getsizeof(["012345", "MYNAME" "Mon", "A", 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20])
>> 160

It returns 160 bytes for your list. Multiply that by 80,000 or 12.8 MB approximately. (32-bit machine with Python 2.7.2, Python 3.2)

siddharthlatest
  • 2,237
  • 1
  • 20
  • 24
1

Applying the current (rev 13) code in the Size of Python objects (revised) recipe and placed in a module called sizeof, and then applying it to your sample list results in the following (using 32-bit Python 2.7.3):

from sizeof import asizeof  # from http://code.activestate.com/recipes/546530

MB = 1024*1024
COPIES = 80000
lis=["012345", "MYNAME" "Mon", "A", 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
     20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]

lis_size = asizeof(lis)
print 'asizeof(lis): {} bytes'.format(lis_size)
list_of_lis_size = asizeof([lis[:] for _ in xrange(COPIES)])
print 'asizeof(list of {:,d} copies of lis): {:,d} bytes ({:.2f} MB)'.format(
                         COPIES, list_of_lis_size, list_of_lis_size/float(MB))
asizeof(lis): 272 bytes
asizeof(list of 80,000 copies of lis): 13,765,784 bytes (13.13 MB)
martineau
  • 119,623
  • 25
  • 170
  • 301
0

Notice the following interaction with the interpreter:

>>> import sys
>>> array = ['this', 'is', 'a', 'string', 'array']
>>> sys.getsizeof(array)
56
>>> list(map(sys.getsizeof, array))
[29, 27, 26, 31, 30]
>>> sys.getsizeof(array) + sum(map(sys.getsizeof, array))
199
>>> 

The answer in this specific case is to use sys.getsizeof(array) + sum(map(sys.getsizeof, array)) to find the size of a list of strings. However, the following would be a more complete implementation that takes into account object containers, classes, and the usages of __slots__.

import sys

def sizeof(obj):
    return _sizeof(obj, set())

def _sizeof(obj, memo):
    # Add this object's size just once.
    location = id(obj)
    if location in memo:
        return 0
    memo.add(location)
    total = sys.getsizeof(obj)
    # Look for any class instance data.
    try:
        obj = vars(obj)
    except TypeError:
        pass
    # Handle containers holding objects.
    if isinstance(obj, (tuple, list, frozenset, set)):
        for item in obj:
            total += _sizeof(item, memo)
    # Handle the two-sided nature of dicts.
    elif isinstance(obj, dict):
        for key, value in dict.items():
            total += _sizeof(key, memo) + _sizeof(value, memo)
    # Handle class instances using __slots__.
    elif hasattr(obj, '__slots__'):
        for key, value in ((name, getattr(obj, name))
            for name in obj.__slots__ if hasattr(obj, name)):
            total += _sizeof(key, memo) + _sizeof(value, memo)
    return total

Edit:

After approaching this problem a while later, the following alternative was devised. Please note that it does not work well with infinite iterators. This code is best for static data structures ready for analysis.

import sys

sizeof = lambda obj: sum(map(sys.getsizeof, explore(obj, set())))

def explore(obj, memo):
    loc = id(obj)
    if loc not in memo:
        memo.add(loc)
        yield obj
        # Handle instances with slots.
        try:
            slots = obj.__slots__
        except AttributeError:
            pass
        else:
            for name in slots:
                try:
                    attr = getattr(obj, name)
                except AttributeError:
                    pass
                else:
                    yield from explore(attr, memo)
        # Handle instances with dict.
        try:
            attrs = obj.__dict__
        except AttributeError:
            pass
        else:
            yield from explore(attrs, memo)
        # Handle dicts or iterables.
        for name in 'keys', 'values', '__iter__':
            try:
                attr = getattr(obj, name)
            except AttributeError:
                pass
            else:
                for item in attr():
                    yield from explore(item, memo)
Noctis Skytower
  • 21,433
  • 16
  • 79
  • 117