numpy -- Transform non-contiguous data to contiguous data in place

Question

Consider the following code:

import numpy as np
a = np.zeros(50)
a[10:20:2] = 1
b = c = a[10:40:4]
print b.flags  # You'll see that b and c are not C_CONTIGUOUS or F_CONTIGUOUS

My question:

Is there a way (with only a reference to b) to make both b and c contiguous? It is completely fine if np.may_share_memory(b,a) returns False after this operation.

Things which are close, but don't quite work out are: np.ascontiguousarray/np.asfortranarray as they will return a new array.

My use case is that I have very large 3D fields stored in a subclass of a numpy.ndarray. In order to save memory, I would like to chop those fields down to the portion of the domain that I am actually interested in processing:

a = a[ix1:ix2,iy1:iy2,iz1:iz2]

Slicing for the subclass is somewhat more restricted than slicing of ndarray objects, but this should work and it will "do the right thing" -- the various custom meta-data attached on the subclass will be transformed/preserved as expected. Unfortunately, since this returns a view, numpy won't free the big array afterward so I don't actually save any memory here.

To be completely clear, I'm looking to accomplish 2 things:

preserve the metadata on my class instance. slicing will work, but I'm not sure about other forms of copying.
make it so that the original array is free to be garbage collected

What do you mean by in-place? `b`'s underlying data is non-contiguous. To make it contiguous, you'd need to allocate new memory. That wouldn't be in-place. — unutbu, Mar 15 '13 at 00:57
@unutbu -- I understand that. See my update on the use case. Maybe that will make it more clear. Ultimately, if you have a class instance `a` with attribute `b`, you can change `a` "in-place" (`a.b = 3`) even though of course `a.b` now points to an entirely different memory location — mgilson, Mar 15 '13 at 00:59
Interesting question! I actually have a very similar use case, but I never tried to optimize the memory usage as much as I should have. — Joe Kington, Mar 15 '13 at 01:25
@JoeKington -- I started downloading some of the datafiles I want to work with and I realized they're ~5Gb each. Thats when I decided I should probably start being a little more conservative :) -- I'm actually trying to look at how `np.copy` is implemented to see if it makes any effort to preserve the meta-data. There's a bunch of stuff with `__array_finalize__` and family which might be useful too but I had previously ruled them out for other reasons (though I'm starting to reconsider) — mgilson, Mar 15 '13 at 01:27
extra attributes. One example is that my subclass holds instance of the grid of the computational domain. So, slicing the array also slices the grid appropriately. The fields also hold a `fieldname` attribute which is what makes messing with `__array_finalize__` a little messy. (e.g. `bx*bx` should not result in an array with a fieldname `bx` (it should be `bx**2`)). Right now I'm thinking about giving my class instances a `_dict` which gets written to whenever `__setattr__` is called. That way, I can write my own copy method to transfer the meta-data to the new object. — mgilson, Mar 15 '13 at 01:39
@JoeKington -- It turns out that you can work around this for fields which own their own data using `np.resize`. See answer by `unutbu` — mgilson, Mar 15 '13 at 03:01

score 6 · Answer 1 · edited May 23 '17 at 12:29

According to Alex Martelli:

"The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates."

However, the following appears to free at least some of the memory: Warning: my way of measuring free memory is Linux-specific:

import time
import numpy as np

def free_memory():
    """
    Return free memory available, including buffer and cached memory
    """
    total = 0
    with open('/proc/meminfo', 'r') as f:
        for line in f:
            line = line.strip()
            if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
                field, amount, unit = line.split()
                amount = int(amount)
                if unit != 'kB':
                    raise ValueError(
                        'Unknown unit {u!r} in /proc/meminfo'.format(u=unit))
                total += amount
    return total

def gen_change_in_memory():
    """
    https://stackoverflow.com/a/14446011/190597 (unutbu)
    """
    f = free_memory()
    diff = 0
    while True:
        yield diff
        f2 = free_memory()
        diff = f - f2
        f = f2
change_in_memory = gen_change_in_memory().next

Before allocating the large array:

print(change_in_memory())
# 0

a = np.zeros(500000)
a[10:20:2] = 1
b = c = a[10:40:4]

After allocating the large array:

print(change_in_memory())
# 3844 # KiB

a[:len(b)] = b
b = a[:len(b)]
a.resize(len(b), refcheck=0)
time.sleep(1)

Free memory increases after resizing:

print(change_in_memory())
# -3708 # KiB

I don't really care if the memory gets freed back to the OS ... I just want to make sure that it's free for python/numpy to use for other fields which I can read from the file. — mgilson, Mar 15 '13 at 01:41
But, `resize` looks like it just might be the thing I'm looking for ... I might be able to work with this one ... — mgilson, Mar 15 '13 at 01:44
WooHoo!!! This works :) -- It's still failing for fields which are computed from other fields complaining that the field doesn't own it's own data, blah, blah, blah. But can be the topic of another question. — mgilson, Mar 15 '13 at 03:00

score 3 · Accepted Answer · answered Mar 15 '13 at 02:41

You can do this in cython:

In [1]:
%load_ext cythonmagic

In [2]:
%%cython
cimport numpy as np

np.import_array()

def to_c_contiguous(np.ndarray a):
    cdef np.ndarray new
    cdef int dim, i
    new = a.copy()
    dim = np.PyArray_NDIM(new)
    for i in range(dim):
        np.PyArray_STRIDES(a)[i] = np.PyArray_STRIDES(new)[i]
    a.data = new.data
    np.PyArray_UpdateFlags(a, np.NPY_C_CONTIGUOUS)
    np.set_array_base(a, new)

In [8]:
import sys
import numpy as np
a = np.random.rand(10, 10, 10)
b = c = a[::2, 1::3, 2::4]
d = a[::2, 1::3, 2::4]
print sys.getrefcount(a)
to_c_contiguous(b)
print sys.getrefcount(a)
print np.all(b==d)

The output is:

4
3
True

to_c_contiguous(a) will create a c_contiguous copy of a, and make it as the base of a.

After the call of to_c_contiguous(b), the refcount of a is decreased, and when the refcount of a become 0, it will be freed.

shx2 · Answer 3 · 2013-03-15T09:18:41.487

1

I would claim the correct way to accomplish the 2 things you listed is by np.copying the slices you create.

Of course, in order for that to work correctly, you would need to define an appropriate __array_finalize__. You weren't very clear about why you decided to avoid it in the first place, but my feeling is that you should define it. (how did you solve the bx**2 problem without using __array_finalize__?)

edited Mar 15 '13 at 09:18

answered Mar 15 '13 at 09:10

shx2

61,779
13
130
153

Maybe I should ask a SO question about that one as well ... I suppose I don't fully understand the call sequence/API between `__array_wrap__` (which I define) and `__array_finalize__` and what they recieve as input arguments in each case. It seems to me that one or the other should be called, not both... – mgilson Mar 15 '13 at 12:52
As far as making a copy is concerned, if I have other refereneces to the array hanging around, then the memory won't be free for python to reuse -- I'd just be making my memory problems worse! If I do the operation in-place, then I mess up other objects which have "views" into this array, but other references to my array will see the resize too. In my mind, this second alternative is preferable to the first. – mgilson Mar 15 '13 at 12:55
So could you have other objects with views into this array? If you do, **don't** use refcheck=0, you are risking seg-fault – shx2 Mar 15 '13 at 13:07

numpy -- Transform non-contiguous data to contiguous data in place

3 Answers3

Linked