10

In my previous question, I learned to resize a subclassed ndarray in place. Neat. Unfortunately, that no longer works when the array that I am trying to resize is the result of a computation:

import numpy as np

class Foo(np.ndarray):
    def __new__(cls,shape,dtype=np.float32,buffer=None,offset=0,
                strides=None,order=None):
        return np.ndarray.__new__(cls,shape,dtype,buffer,offset,strides,order)

    def __array_prepare__(self,output,context):
        print output.flags['OWNDATA'],"PREPARE",type(output)
        return np.ndarray.__array_prepare__(self,output,context)

    def __array_wrap__(self,output,context=None):
        print output.flags['OWNDATA'],"WRAP",type(output)

        return np.ndarray.__array_wrap__(self,output,context)

a = Foo((32,))
#resizing a is no problem
a.resize((24,),refcheck=False)

b = Foo((32,))
c = Foo((32,))

d = b+c
#Cannot resize `d`
d.resize((24,),refcheck=False)

The exact output (including traceback) is:

True PREPARE <type 'numpy.ndarray'>
False WRAP <class '__main__.Foo'>
Traceback (most recent call last):
  File "test.py", line 26, in <module>
    d.resize((24,),refcheck=False)
ValueError: cannot resize this array: it does not own its data

I think this is because numpy creates a new ndarray and passes it to __array_prepare__. At some point along the way though, it seems that the "output" array gets view-casted to my Foo type, although the docs don't seem to be 100% clear/accurate on this point. In any event, after the view casting, the output no longer owns the data making it impossible to reshape in place (as far as I can tell).

Is there any way, via some sort of numpy voodoo (__array_prepare__, __array__) etc. to transfer ownership of the data to the instance of my subclass?

Community
  • 1
  • 1
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • On further searching, this user seems to be [asking the same thing](http://stackoverflow.com/questions/8708758/can-i-force-a-numpy-ndarray-to-take-ownership-of-its-memory), although from a different perspective... It's possible that there could be an answer that works in my situation which wouldn't work there ... – mgilson Mar 15 '13 at 03:38
  • 1
    It's certainly a view. The base `ndarray` is `d.base`, which owns the data. As a kludge you could resize `d.base` and then reassign `d = d.base.view(Foo)`. – Eryk Sun Mar 15 '13 at 04:19
  • 1
    eryksun -- sounds promising except then I would possibly lose the other attributes that I've added to my field in the meantime. Post that as an answer anyway though. It'll give me something to experiment with tomorrow. I'm a bit too tired to attempt anything more than mindless commenting of my code at this point. – mgilson Mar 15 '13 at 04:30
  • It may help, but it doesn't answer the question. I could force it with ctypes by tweaking the flags and base, but that's ugly and fragile. – Eryk Sun Mar 15 '13 at 04:36
  • Yeah, that seemed similar to what another user posted via `cython` on the other answer. What happens if I resize `d.base` and don't reassign? Presumably that's a really bad idea? – mgilson Mar 15 '13 at 04:39
  • 1
    To resize you need to use `refcheck=False`. This invalidates the current `Foo` view, so you need a new one. – Eryk Sun Mar 15 '13 at 04:41
  • @eryksun -- That's what I figured. – mgilson Mar 15 '13 at 04:51
  • You might be able to avoid the need to resize if you use [numexpr](http://code.google.com/p/numexpr/). (See the section entitled ["Why It Works"](http://code.google.com/p/numexpr/#Why_It_Works) to see how memory is managed to avoid large temporary arrays. – unutbu Mar 15 '13 at 20:59
  • @unutbu -- Yes. That one was definitely in the back of my mind. In fact, I'm already `eval` ing strings (with about a million safety precautions -- please don't yell at me). `numexpr` would seem to be a better way of handling that and it is on the roadmap in my head. – mgilson Mar 15 '13 at 21:01

3 Answers3

6

It is hardly a satisfactory answer, but it doesn't fit into a comment either... You can work around the owning of the data by using the ufunc's out parameter. A silly example:

>>> a = Foo((5,))
>>> b = Foo((5,))
>>> c = a + b # BAD
True PREPARE <type 'numpy.ndarray'>
False WRAP <class '__main__.Foo'>
>>> c.flags.owndata
False

>>> c = Foo((5,))
>>> c[:] = a + b # BETTER
True PREPARE <type 'numpy.ndarray'>
False WRAP <class '__main__.Foo'>
>>> c.flags.owndata
True

>>> np.add(a, b, out=c) # BEST
True PREPARE <class '__main__.Foo'>
True WRAP <class '__main__.Foo'>
Foo([  1.37754085e-38,   1.68450356e-20,   6.91042737e-37,
         1.74735556e-04,   1.48018885e+29], dtype=float32)
>>> c.flags.owndata
True

I think that the output above is consistent with c[:] = a + b getting to own the data at the expense of copying it into c from a temporary array. But that shouldn't be happening when you use the out parameter.

Since you were already worried about intermediate storage in your mathematical expressions, it may not be such a bad thing to micro-manage how it is handled. That is, replacing

g = a + b + np.sqrt(d*d + e*e + f*f)

with

g = foo_like(d) # you'll need to write this function!
np.multiply(d, d, out=g)
g += e * e
g += f * f
np.sqrt(g, out=g)
g += b
g += a

may save you some intermediate memory, and it lets you own your data. It does throw the "readability counts" mantra out the window, but...

Jaime
  • 65,696
  • 17
  • 124
  • 159
  • Yeah, I realized that you could use `out`. In my case, that's not the greatest if you want to do `a + b * sqrt(c) **4` or whatever. (writing that all out as function calls isn't fun). But, I had never thought about using a slice assignment to make sure that the LHS owned the data. That's a nice touch. – mgilson Mar 15 '13 at 20:58
  • 1
    @mgilson It is just a copy in hiding, though. – Jaime Mar 15 '13 at 21:51
1

At some point along the way though, it seems that the "output" array gets view-casted to my Foo type

Yes, ndarray.__array_prepare__ calls output.view, which returns an array which does not own its data.

I experimented a bit and couldn't find an easy way around that.

While I agree this behavior is not ideal, at least in your use case, I would claim it is acceptable for d to not own its data. Numpy uses views extensively and if you insist on avoiding creating any views in your working with numpy arrays, you're making your life very hard.

I would also claim that, based on my experience, resize should generally be avoided. You should not have any problem working with the view created if you avoid resizeing. There's a hacky feeling to it, and it's hard to work with (as you might begin to understand, having encountered one of the two classic errors when using it: it does not own its data. The other is cannot resize an array that has been referenced). (Another problem is described in this quesion.)

Since your decision to use resize comes from an answer to your other question, I'll post the rest of my answer there.

Community
  • 1
  • 1
shx2
  • 61,779
  • 13
  • 130
  • 153
  • 1
    I fully agree, the mere fact that there is a `refcheck=False` is big time playing with fire. There seems to be a myth that copying is slow, but "slow" is relative and unless you carefully timed it to be a real difference and the code is very specific... don't... – seberg Mar 15 '13 at 12:09
  • @seberg -- For the time being, I am copying the data at the end of the operation -- But copying for every `ufunc` seems like a bad idea. – mgilson Mar 15 '13 at 12:46
  • @mgilson, and why do you need that in the first place, why do you have to shrink the data *after* the ufunc so much that it is worth to even consider resize? And if you really need only a tiny part, you can copy out that small part without a serious performance penalty. – seberg Mar 15 '13 at 12:55
  • @seberg -- I have something like `a + b + sqrt(d*d + e*e + f*f) ...` This operation already makes a bunch of temporary arrays, I'd like to not make more. Every copy I make puts me that much closer to hitting swap (which seems feasible for the large arrays I'm working with). Each field is ~512 Mb. I'm explicitly storing at least 6 which puts me at 3Gb memory before I've even tried to do a math operation with them. If I can resize the arrays in my datafile reader (which is my plan), I shouldn't have other views of the array floating around to be a big problem anyway. – mgilson Mar 15 '13 at 13:02
  • @seberg or others If you are interested in discussing my program design (I'd love to get input), I can be found hanging around the python chat room pretty frequently: http://chat.stackoverflow.com/rooms/6/python (Sorry -- this is the correct link now) – mgilson Mar 15 '13 at 13:05
0

How about:

def resize(arr, shape):
    np.require(arr, requirements=['OWNDATA'])
    arr.resize(shape, refcheck=False)

It seems to succeed at resizing (and reducing memory consumption):

import array
import numpy as np
import time

class Foo(np.ndarray):
    def __new__(cls, shape, dtype=np.float32, buffer=None, offset=0,
                strides=None, order=None):
        return np.ndarray.__new__(cls, shape, dtype, buffer, offset, strides, order)

    def __array_prepare__(self, output, context):
        print(output.flags['OWNDATA'], "PREPARE", type(output))
        return np.ndarray.__array_prepare__(self, output, context)

    def __array_wrap__(self, output, context=None):
        print(output.flags['OWNDATA'], "WRAP", type(output))
        output = np.ndarray.__array_wrap__(self, output, context)
        return output

def free_memory():
    """
    Return free memory available, including buffer and cached memory
    """
    total = 0
    with open('/proc/meminfo', 'r') as f:
        for line in f:
            line = line.strip()
            if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
                field, amount, unit = line.split()
                amount = int(amount)
                if unit != 'kB':
                    raise ValueError(
                        'Unknown unit {u!r} in /proc/meminfo'.format(u=unit))
                total += amount
    return total


def gen_change_in_memory():
    """
    http://stackoverflow.com/a/14446011/190597 (unutbu)
    """
    f = free_memory()
    diff = 0
    while True:
        yield diff
        f2 = free_memory()
        diff = f - f2
        f = f2
change_in_memory = gen_change_in_memory().next

def resize(arr, shape):
    print(change_in_memory())
    # 0
    np.require(arr, requirements=['OWNDATA'])

    time.sleep(1)
    print(change_in_memory())
    # 200

    arr.resize(shape, refcheck=False)

N = 10000000
b = Foo((N,), buffer = array.array('f',range(N)))
c = Foo((N,), buffer = array.array('f',range(N)))

yields

print(change_in_memory())
# 0

d = b+c
d = np.require(d, requirements=['OWNDATA'])

print(change_in_memory())
# 39136

resize(d, (24,))   # Increases memory by 200 KiB
time.sleep(1)
print(change_in_memory())
# -39116
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • The notes in the docs of `np.require` say: "The returned array will be guaranteed to have the listed requirements **by making a copy if needed.**" What do you think is happening, other than a copy, to make this work? I would tend to think that in the end this is the same as `d = d.copy()`, although it doesn´t fully make sense either. – Jaime Mar 15 '13 at 14:16
  • @Jaime: In the situation above, where `d = b+c`, I don't understand why `d` does not own its data. If you understand why, I'd really appreciate an explanation. It appears at least in cases such as this, `np.require` succeeds without copying the data as evidenced by the increase in free memory. – unutbu Mar 15 '13 at 17:59
  • @unutbu, no it does copy the data, but it drops the reference to the original, so it also frees the same amount again. It does not own its data since it is based on a base class ndarray, that may be a flaw, but... – seberg Mar 15 '13 at 20:21
  • @seberg: Thanks for the info. I am puzzled however. I added some `print(change_in_memory())` statements around `np.require(...)` and it seems to show only 200 KiB of memory being consumed. If the array were being copied, wouldn't the memory consumption be much more? – unutbu Mar 15 '13 at 20:41
  • only if you keep a reference to the original `d` around. (which in turn keeps the reference to `d.base` which holds the data) – seberg Mar 15 '13 at 22:11