5

I'm trying to understand the memory and other overhead implications that using numpy lists would have for arrays of dtype object compared to lists of lists.

Does this change with dimensionality? eg 2D vs 3D vs N-D.

Some of the the benefits I can think of when using numpy arrays are that things like .shape, .T and that you can cast them as matrices with np.matrix much faster.

Is there anything else?

Also, if anyone is interested the object I'm using is:

import gmpy2 as gm
gm.mpfr( '0' )                                           # <-- this is the object

EDIT:

Just to clarify I'm interested in the case where the numpy array type is object not a native numpy-type.

EDIT 2:

Relevant follow up with regards to speed.

What are the benefits / drawbacks of a list of lists compared to a numpy array of OBJECTS with regards to SPEED?

Community
  • 1
  • 1
evan54
  • 3,585
  • 5
  • 34
  • 61
  • 2
    [This question](http://stackoverflow.com/q/993984/3923281) and its answers may be relevant... – Alex Riley Nov 05 '14 at 21:45
  • indeed. though one thing that worries me is the fact that numpy is faster for floating points but what happens when you have objects instead of floating points? Does a numpy array dumb down to a list of lists? – evan54 Nov 05 '14 at 21:47
  • This seems like too broad of a question. There are a number of specific questions buried in here, like comparing the memory usage of, say, a 10x10x10x10, 100x100, and 10000 array of int32 vs. the equivalent nested list of ints, but each one deserves its own question. – abarnert Nov 05 '14 at 21:55
  • Hm... ok so should I try to break it into a memory question as you said and a speed question? also, the int32 case is a native numpy type so I'm not sure if it's relevant to the question. The main point of the question is what happens when arrays are of dtype `object` – evan54 Nov 05 '14 at 21:57
  • 1
    @evan54: Yeah, making this just about memory, and just about `object`, makes it a good question. – abarnert Nov 05 '14 at 21:58
  • ok I'll modify it to only address memory issues and `object` issues only here. And I'll post a corresponding speed question – evan54 Nov 05 '14 at 21:59
  • posted both questions. I think this should be fine. Please edit if you deem appropriate – evan54 Nov 05 '14 at 22:04
  • Very good point - `object` will quite probably mess things up, since it cannot know how much memory to allocate for each of them. So it will have to store pointers to these objects, and that is effectively a list as implemented by python. The difference to a list of lists is that at least these references would probably be contiguous in memory, whereas each list in the list of lists can start wherever it wants. +1 for the question, keen to know the answer – eickenberg Nov 05 '14 at 22:05
  • @eickenberg thanks for that. do you think you can evolve your comment into an answer? – evan54 Nov 05 '14 at 22:06
  • Wouldn't venture that far, I'm only making guesses. Hoping somebody with in depth knowledge on this will show up and answer :) – eickenberg Nov 05 '14 at 22:07

1 Answers1

8

I'm going to answer your primary question, and leave the others (performance of transpose, etc.) out. So:

I'm trying to understand the memory and other overhead implications that using numpy lists would have … Just to clarify I'm interested in the case where the numpy array type is object not a float, double or int

A Python list is an array of pointers to Python objects wrapping whatever actual values you've stored in it—plus some extra slack to allow it to be expanded on the fly efficiently. Let's call that slack 20%, just for the sake of easy computation. For example, an list of 10000 32-bit integers takes up, say, 96000 bytes for the array, plus around 240000 bytes for the Python integer objects, plus a small overhead for the list itself, say 80 bytes again.

A NumPy array is an array of whatever actual values you've stored in it. For example, an array of 10000 32-bit integers takes up 40000 bytes, plus a small overhead for the array itself, say 80 bytes. But when you're using dtype object, each "actual value" is just a pointer to a Python object, just as with a list.

So, the only real difference here is the slack: The array is going to use 320080 bytes, while the list is going to use 336080 bytes. Not a huge difference, but it can matter.


Also, does one become faster than the other in 2D vs ND or with the size along a given dimension.

Yes, the nested list will increase faster… but not by a huge amount.

A multidimensional array in numpy is stored as one giant array (in either C or Fortran striding order), so whether the shape is (10000,), (100, 100), or (10, 10, 10, 10), it's the same size. (The overhead may get a few bytes bigger to store more information about the striding, but if we're talking about, say, 256 bytes vs. 80 out of 320K, who cares?)

A nested list, on the other hand, has more lists, with slack and overhead at every level. For example, a list of 10 lists of 10 lists of 10 lists of integers has 1+10+100+1000 arrays of 12 pointers, and 1+10+100+1000 list headers.

So, the array is still using 320080 bytes, or maybe 320256, but the list is using 435536.


If you want to know more about how list is implemented… well, that depends on the implementation you're using. But in CPython, the C API pretty much guarantees that it's going to be storing a contiguous array of PyObject *, and the fact that appending is amortized constant time pretty much requires that it leave slack that grows proportionately. And you can see in the headers and the source that this is exactly what it does. (Also, keep in mind that the specific sizes you get from that source will generally depend on the platform you compile it on. Most importantly, because there are pointers all over the place, 64-bit platforms tend to have somewhere between 50-100% more overhead per object for most objects than 32-bit platforms.)

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • This is very interesting and along the lines of what I thought may be the case, +1. I'd be interested in exact values for the slack size and overhead size in general for lists / object arrays. Any pointers to where I can read up on that? – eickenberg Nov 05 '14 at 22:11
  • 1
    @eickenberg: I've edited the answer. Since it's not documented anywhere, you need to pick a Python implementation and version that you care about and read the source. I've linked to the source for the trunk version of CPython. – abarnert Nov 05 '14 at 22:12