Numpy.mean, amin, amax, std huge returns

Question

I am struggling working with large numpy arrays. Here is the scenario. I am working with 300MB - 950MB images and using GDAL to read them as Numpy arrays. Reading in the array uses exactly as much memory as one would expect, ie. 250MB for a 250MB image, etc...

My problem occurs when I use numpy to get the mean, min, max, or standard deviation. In main() I open the image and read the array (type ndarray). I then call the following function, to get the standard deviation, on a 2D array:

def get_array_std(input_array):
    array_standard_deviation = numpy.std(input_array, copy=False)
    return array_standard_deviation

Here I am constantly having memory errors (on a 6GB machine). From the documentation it looks like numpy is returning an ndarray with the same shape and dtype as my input, thereby doubling the in memory size.

Using:

print type(array_standard_deviation)

Returns:

numpy.float64

Additionally, using:

print array_standard_deviation

Returns a float std as one would expect. Is numpy reading the array in again to perform this calculation? Would I be better off iterating through the array and manually performing the calculation(s)? How about working with a flattened array?

I have tried placing each statistic call (numpy.amin(), numpy.amax(), numpy.std(), numpy.mean()) into their own function so that the large array would go out of scope, but no luck there. I have also tried casting the return to another type, but no joy.

score 4 · Accepted Answer · answered Nov 03 '11 at 09:35

4

Numpy does a "naive" reduce operation for std. It is quite memory inefficient. Look here for a better implementation: http://luispedro.org/software/ncreduce

answered Nov 03 '11 at 09:35

Andreas Mueller

27,470
8
62
74

score 1 · Answer 2 · answered Aug 05 '11 at 17:19

1

Don't know if this is helpful, but does using the array method resolve the issue? i.e.

input_array.std()

instead of

numpy.std(input_array)

The problem you describe doesn't make a whole lot of sense to me; I work with large arrays often but don't encounter errors with simple tasks like these. Is there anything else you're doing that might end up passing copies of the arrays instead of references?

answered Aug 05 '11 at 17:19

keflavich

18,278
20
86
118

Just tried changing to input_array.std() and no luck. By inserting raw_input() calls and using resource monitor, I am able to watch python memory usage increase by the array size with each call. I am passing input_array to the function and then calling it as above. This [link](http://stackoverflow.com/questions/986006/python-how-do-i-pass-a-variable-by-reference) indicates that I am passing the variable correctly. – Jzl5325 Aug 05 '11 at 17:44
That's really strange. Could you replace the function call with just a direct call to the numpy.std() function? Or is there some reason you need to wrap numpy.std() with get_array_std()? I'm guessing you have some reason to be doing that. I think you are passing the variable correctly, based on what you've told me. In what context are you calling the get_array_std() when you see the resource consumption jump; have you been testing it in a minimalistic script? – keflavich Aug 16 '11 at 18:22

score 1 · Answer 3 · answered Oct 11 '11 at 20:57

Are you sure this is a problem with all of the statistics functions you're trying, or is it just np.std?

I've tried the following method to reproduce this:

Start ipython -cs 0, import numpy as nd
q = rand(5600,16000), giving me a nice large test array.
Watch memory usage externally during np.mean(q), np.amin(q), np.amax(q), np.std(q)

Of these, np.std is significantly slower: most functions take 0.2 seconds on my computer, whereas std takes 2.3. While I don't have the exact memory leak you have, my memory usage stays mostly constant while running everything except std, but doubles when I run std, and then goes back down to the initial amount.

I've written the following modified std, which operates on chunks of a given number of elements (I'm using 100000):

def chunked_std( A, chunksize ):
    Aflat = A.ravel()
    Amean = A.mean()
    Alen = len(Aflat)

    i = np.concatenate( ( np.arange(0,Alen,chunksize), [Alen] ) )

    return np.sqrt(np.sum(np.sum(abs(Aflat[x:y]-Amean)**2) for (x,y) in zip(i[:-1],i[1:]))/Alen)

This seems to significantly reduce memory usage, while also being about twice as fast as normal np.std for me. There are probably significantly more elegant ways of writing such a function, but this seems to work.

Numpy.mean, amin, amax, std huge returns

3 Answers3