1

I need to handle some large numpy arrays in my project. After such an array is loaded from the disk, over half of my computer's memory will be consumed.

After the array is loaded, I make several slices (almost half of the array will be selected) of it, then I receive error tells me the memory is insufficient.

By doing a little experiment I understand, I receive the error because when a numpy array is sliced, a copy will be created

import numpy as np

tmp = np.linspace(1, 100, 100)
inds = list(range(100))
tmp_slice = tmp[inds]

assert id(tmp) == id(tmp_slice)

returns AssertionError

Is there a way that a slice of a numpy array only refers to the memory addresses of the original array thus data entries are not copied?

meTchaikovsky
  • 7,478
  • 2
  • 15
  • 34
  • Your `id` test just compares two different python objects. It does not compare their element storage. You may need to read more about basic numpy array layout. – hpaulj Sep 17 '19 at 11:47

2 Answers2

3

In Python slice is a well defined class, with start, stop, step values. It is used when we index a list with alist[1: 10: 2]. This makes a new list with copies of the pointers from the original. In numpy these are used in basic indexing, e.g. arr[:3, -3:]. This creates a view of the original. The view shares the data buffer, but has its own shape and strides.

But when we index arrays with lists, arrays or boolean arrays (mask), it has to make a copy, an array with its own data buffer. The selection of elements is too complex or irregular to express in terms of the shape and strides attributes.

In some cases the index array is small (compared to the original) and copy is also small. But if we are permuting the whole array, then the index array, and copy will both be as large as the original.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
1

Reading through this, this, and this I think your problem is in using advanced slicing, and to reiterate one of the answers -- numpy docs clearly state that

Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).

So instead of doing:

inds = list(range(100))
tmp_slice = tmp[inds]

you should rather use:

tmp_slice = tmp[:100]

This will result in a view rather than a copy. You can notice the difference by trying:

tmp[0] = 5

In the first case tmp_slice[0] will return 1.0, but in the second it will return 5.

gstukelj
  • 2,291
  • 1
  • 7
  • 20
  • 1
    Thank you for your answer! The `inds` in the example is only for demonstrating, in the real project, I have to use advanced slicing because the mask is something like `[True, False, True, False]` in which the selected `inds` are not consecutive. Is there another way? – meTchaikovsky Sep 17 '19 at 08:32
  • 1
    I think that as soon as you use masking this is advanced indexing and hence necessitates copying data. Basically accessing the memory non-contiguously will inevitably result in a copy. – gstukelj Sep 17 '19 at 08:33
  • Maybe try with pandas? https://stackoverflow.com/questions/33103988/pandas-best-way-to-subset-a-dataframe-inplace-using-a-mask – gstukelj Sep 17 '19 at 08:37
  • no I can't unless I refactor the code since the all the masks are generated before slicing, so I can't do anything to the original array until all masks are applied. – meTchaikovsky Sep 17 '19 at 09:01
  • A boolean mask is just as big as the original array. – hpaulj Sep 17 '19 at 11:41
  • @hpaulj that's a good point -- maybe try with np.where or a simple for loop instead of having masks generated beforehand? – gstukelj Sep 17 '19 at 12:46