10

I have a large data in matrix x and I need to analyze some some submatrices.

I am using the following code to select the submatrix:

>>> import numpy as np
>>> x = np.random.normal(0,1,(20,2))
>>> x
array([[-1.03266826,  0.04646684],
       [ 0.05898304,  0.31834926],
       [-0.1916809 , -0.97929025],
       [-0.48837085, -0.62295003],
       [-0.50731017,  0.50305894],
       [ 0.06457385, -0.10670002],
       [-0.72573604,  1.10026385],
       [-0.90893845,  0.99827162],
       [ 0.20714399, -0.56965615],
       [ 0.8041371 ,  0.21910274],
       [-0.65882317,  0.2657183 ],
       [-1.1214074 , -0.39886425],
       [ 0.0784783 , -0.21630006],
       [-0.91802557, -0.20178683],
       [ 0.88268539, -0.66470235],
       [-0.03652459,  1.49798484],
       [ 1.76329838, -0.26554555],
       [-0.97546845, -2.41823586],
       [ 0.32335103, -1.35091711],
       [-0.12981597,  0.27591674]])
>>> index = x[:,1] > 0
>>> index
array([ True,  True, False, False,  True, False,  True,  True, False,
        True,  True, False, False, False, False,  True, False, False,
       False,  True], dtype=bool)
>>> x1 = x[index, :] #x1 is a copy of the submatrix
>>> x1
array([[-1.03266826,  0.04646684],
       [ 0.05898304,  0.31834926],
       [-0.50731017,  0.50305894],
       [-0.72573604,  1.10026385],
       [-0.90893845,  0.99827162],
       [ 0.8041371 ,  0.21910274],
       [-0.65882317,  0.2657183 ],
       [-0.03652459,  1.49798484],
       [-0.12981597,  0.27591674]])
>>> x1[0,0] = 1000
>>> x1
array([[  1.00000000e+03,   4.64668400e-02],
       [  5.89830401e-02,   3.18349259e-01],
       [ -5.07310170e-01,   5.03058935e-01],
       [ -7.25736045e-01,   1.10026385e+00],
       [ -9.08938455e-01,   9.98271624e-01],
       [  8.04137104e-01,   2.19102741e-01],
       [ -6.58823174e-01,   2.65718300e-01],
       [ -3.65245877e-02,   1.49798484e+00],
       [ -1.29815968e-01,   2.75916735e-01]])
>>> x
array([[-1.03266826,  0.04646684],
       [ 0.05898304,  0.31834926],
       [-0.1916809 , -0.97929025],
       [-0.48837085, -0.62295003],
       [-0.50731017,  0.50305894],
       [ 0.06457385, -0.10670002],
       [-0.72573604,  1.10026385],
       [-0.90893845,  0.99827162],
       [ 0.20714399, -0.56965615],
       [ 0.8041371 ,  0.21910274],
       [-0.65882317,  0.2657183 ],
       [-1.1214074 , -0.39886425],
       [ 0.0784783 , -0.21630006],
       [-0.91802557, -0.20178683],
       [ 0.88268539, -0.66470235],
       [-0.03652459,  1.49798484],
       [ 1.76329838, -0.26554555],
       [-0.97546845, -2.41823586],
       [ 0.32335103, -1.35091711],
       [-0.12981597,  0.27591674]])
>>> 

but I would like x1 to be only a pointer or something like this. Copy the data every time that I need a submatrix is too expensive for me. How can I do that?

EDIT: Apparently there is not any solution with the numpy array. Are the pandas data frame better from this point of view?

itzy
  • 11,275
  • 15
  • 63
  • 96
Donbeo
  • 17,067
  • 37
  • 114
  • 188
  • You would `x1` to be a [*view*](http://docs.scipy.org/doc/numpy-1.6.0/glossary.html#term-view) of `x`, but this is not possible with [*advanced indexing*](http://docs.scipy.org/doc/numpy-1.6.0/reference/arrays.indexing.html#advanced-indexing). The numpy manual is pretty clear about that in the [section about indexing](http://docs.scipy.org/doc/numpy-1.6.0/reference/arrays.indexing.html#advanced-indexing). – Stefano M May 14 '15 at 13:33
  • In a comment you state that you are recursively passing the data to a function. Why not leave the data alone in a global and pass the index on the stack? – Stefano M May 14 '15 at 13:55
  • I think that this will still require an advanced indexing that is equivalent to copy the data. – Donbeo May 14 '15 at 15:05
  • My impression is that this problem can be solved storing the data in a pandas dataframe but I am not sure about that and in each case I do not know how. – Donbeo May 14 '15 at 15:06

3 Answers3

3

Since index is an array of type bool, you are doing advanced indexing. And the docs say: „Advanced indexing always returns a copy of the data.“

This makes a lot of sense. Compared to normal indexing where you only need to know the start, stop and step, advanced indexing can use any value from the original array without such a simple rule. This would mean having lots of extra meta information where referenced indices point to that might use more memory than a copy.

Mike Müller
  • 82,630
  • 20
  • 166
  • 161
  • what if I use `index = np.argwhere(x[:,1]>0).ravel()` ? This is not boolean any more – Donbeo May 14 '15 at 13:39
  • it will still be advanced indexing, unfortunately – paddyg May 14 '15 at 13:42
  • `ndarray`'s data structure is ideal for fast linear algebra operations on them. This means that you need a constant stride from one element to the next (and this is possible with basic slicing). With advanced indexing you have a non constant stride from one element to the next one. – Stefano M May 14 '15 at 13:50
3

The information for your array x is summarized in the .__array_interface__ property

In [433]: x.__array_interface__
Out[433]: 
{'descr': [('', '<f8')],
 'strides': None,
 'data': (171396104, False),
 'typestr': '<f8',
 'version': 3,
 'shape': (20, 2)}

It has the array shape, strides (default here), and pointer to the data buffer. A view can point to the same data buffer (possibly further along), and have its own shape and strides.

But indexing with your boolean can't be summarized in those few numbers. Either it has to carry the index array all the way through, or copy selected items from the x data buffer. numpy chooses to copy. You have choice of when to apply the index, now or further down the calling stack.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
1

If you can manage with a traditional slice such as

x1 = x[3:8]

Then it will be just a pointer.

Have you looked at using masked arrays? You might be able to do exactly what you want.

x = np.array([0.12, 0.23],
             [1.23, 3.32],
               ...
             [0.75, 1.23]])

data = np.array([[False, False],
                 [True, True],
                ...
                 [True, True]])

x1 = np.ma.array(x, mask=data)
## x1 can be worked on and only includes elements of x where data==False
paddyg
  • 2,153
  • 20
  • 24
  • I am recursively passing submatrix of the data to a function that does not modify it. This is why I would like to pass only a view or something like this. – Donbeo May 14 '15 at 13:41
  • The point is that `x[index, :]` *is* a copy. Even if you do not store a reference to it, it is already eating up your memory. – Stefano M May 14 '15 at 13:42
  • yes, and I suspect masking will use a bit of copying and maintenance of indexes etc. – paddyg May 14 '15 at 14:09