1

I generate an numpy array in python using the simple code below. When I print out the object size in the console, I learn that the object is using 228 MB of memory. But when I look at what is happening to my actual RAM, I get a very different result. In the System Monitor's resources tab I can see an increase of 1.3 GB in memory usage while generating this array. To be sure that it's cause by python, I also watched the process tab. Same thing there. The process "python3.5" increases its memory usage up to 1.3 GB during the 10 seconds, which the script needs to finish.

This means python takes up almost six times as much memory, as it should for this object. I would understand a certain memory overhead for managing the objects, but not a 6-fold increase. I did not find a understandable explanation for why I can't use python to e.g. read-in files, which are bigger than one sixth of my memory.

import sys
import numpy as np
scale = 30000000
vector1 = np.array([x for x in range(scale)])
# vector1 = np.array(list(range(scale))) # same thing here
print(((sys.getsizeof(vector1)/1024)/1024.0), 'MB')

Thanks for any understandable explanation for this.

Edit: And for solutions to fix it.

P. Zeek
  • 43
  • 6
  • Can you provide output / screenshot of your memory usage assessment ? Measuring memory consumption by some process is far from being trivial most of time, lots of people are known to be gravely wrong interpreting results properly – agg3l Oct 21 '16 at 22:02
  • 1
    As described I used ubuntu's System Manger. I don't see how it could be interpreted wrong, if a process grows in a linear manner up to 1.3 GB. Its not a spike but an incremental growth. It's reproducible on other machines, I tried it before I posted. – P. Zeek Oct 21 '16 at 22:30
  • 1
    Further, if I add one more zero to the scale-variable, the process should take 2.3 GB, which are easily available. However it exceeds that number again by far, the system runs out of memory, the swap memory gets filled and all running applications become practically unresponsive. Pressing the power button until the machine turns off and then booting again seems to be the only way out of this. – P. Zeek Oct 21 '16 at 22:30
  • Virtual/Reserved/Committed/Shared memory entries are there in system diagnostic tools. Not everyone use Ubuntu GUI and its bundled tools daily, you know... – agg3l Oct 21 '16 at 22:37
  • 1
    `numpy` seeks to reduce the overhead of python objects but when you do `[x for x in range(scale)]`, well, you created a big one, even if its just for a short period of time. That memory is sitting in the process heap available for future allocation but it is there. – tdelaney Oct 21 '16 at 22:41
  • Here is an interesting discussion on creating numpy arrays http://stackoverflow.com/questions/367565/how-do-i-build-a-numpy-array-from-a-generator – tdelaney Oct 21 '16 at 22:49

1 Answers1

2

I believe you can fix this by using the np.arange function.

vector1 = np.arange(scale)

I was reproducing the same behavior when I built the numpy array by passing a list-comprehension (i.e. a list) to the np.array constructor. The problem is that clearly the list used as the argument is not getting garbage-collected. I could only speculate as to why.

tdelenay's comment

The list is being deleted because its reference goes to zero. Python returns the memory to the heap where it can be used when creating new objects. The heap will not give the memory back to the system right away. That's why the process memory usage is still high.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • 1
    The list is being deleted because its reference goes to zero. Python returns the memory to the heap where it can be used when creating new objects. The heap will not give the memory back to the system right away. That's why the process memory usage is still high. – tdelaney Oct 21 '16 at 22:35
  • Well, there you go. – juanpa.arrivillaga Oct 21 '16 at 22:38
  • 1
    This is the answer though... use `numpy` as much as possible to avoid creating more expensive python objects. Just remove that part about garbage collection and this nails it. – tdelaney Oct 21 '16 at 22:39
  • 1
    If the memory of the list is returned to the heap and kept reserved, then the memory consumption should be twice the size of the data not six times as much, as I understand. – P. Zeek Oct 21 '16 at 22:44
  • 1
    @P.Zeek No, a python list takes *significantly* more memory than a numpy array. Python lists are essentially resiable arrays of object pointers. Each `int` in Python is an object itself, taking up about 24 bytes. Numpy arrays are essentially object-oriented C-arrays with a python wrapper. – juanpa.arrivillaga Oct 21 '16 at 22:47
  • In regards to juanpa.arrivillagas original answer: The list-argument is surely not five times the size of my array. Also, I don't actually need a vector like this. What I'm really trying to do is reading a BytesIO object from memory into a numpy array. I just posted a simple example, which demonstrates the problem I'm dealing with, instead of posting 200 lines of code. – P. Zeek Oct 21 '16 at 22:47
  • 2
    @P.Zeek Yes, a Python list is easily five or six times the size of an equivalent numpy array. – juanpa.arrivillaga Oct 21 '16 at 22:48
  • 1
    @P.Zeek if you are trying to read from a BytesIO, try looking at `numpy.fromfile`. – juanpa.arrivillaga Oct 21 '16 at 22:50
  • Python list: ((24bytes*30000000)/1024)/1024 = 686 MB + 228 MB = 914 MB. Still not at 1.3 GB. – P. Zeek Oct 21 '16 at 22:53
  • Thanks for the tip. But I was using numpy.loadtxt, since it allows to read and unpack the data all in one. Unfortunatley its only possible with small files and very slow. pandas.read_csv promises better performance. – P. Zeek Oct 21 '16 at 22:56
  • @P.Zeek, Again, you are not fully accounting for the memory usage of a Python list. Anyway, in Python 3 each individual `int` object, which the list only stores a pointer to, is closer to 28 bytes, and can be bigger for larger `ints` since in Python 3 `ints` are arbitrary-sized. – juanpa.arrivillaga Oct 21 '16 at 23:00
  • @P.Zeek So, ((28 + 8) * 30000000)/1024**2 assuming a 64-bit system (8byte pointes) gives around 1030MB just for the list. – juanpa.arrivillaga Oct 21 '16 at 23:06
  • I now doubt the explanation, that because of object pointer arrays a python list is easily six times bigger then the numpy list. mylist = [x for x in range(30000000)]; print(((sys.getsizeof(mylist)/1024)/1024.0), 'MB') results in 252 MB for the python list. Not 914 MB like I was willing to assume earlier. – P. Zeek Oct 21 '16 at 23:07
  • `sys.getsizeof(mylist)` only accounts for the size of the *POINTERS*. So multiply *that* by about 28-30. Say 29. – juanpa.arrivillaga Oct 21 '16 at 23:10
  • 1
    @P.Zeek In other words, try `sys.getsizeof(mylist) + sum(sys.getsizeof(x) for x in mylist)` – juanpa.arrivillaga Oct 21 '16 at 23:12
  • @ShadowRanger Yes, I didn't want to get into the small-int cache. But that is a good point. Anyway, I was just trying to explain this particular case. – juanpa.arrivillaga Oct 21 '16 at 23:16
  • 1
    The documentation of sys.getsizeof is somewhat misleading when it says 'Return the size of an object in bytes'. Since pointers are not an everyday topic in python, I assumed the integer objects are actually stored in the list object and not just their pointers. But when I try mylist = ['aaa' for x in range(30000000)] then sys.getsizeof(mylist) stays the same as before, while the sys.getsizeof-sum of the elements increases. Now the math starts to add up, you're right. Thanks for explaining! – P. Zeek Oct 21 '16 at 23:30
  • @P.Zeek Essentially, almost everything you interact with in Python is an object reference. – juanpa.arrivillaga Oct 21 '16 at 23:39