2

Background: The data I'm using is being extracted from a netCDF4 object, which creates a numpy masked array at initialization, but does not appear to support the numpy reshape() method, making it only possible to reshape after all the data has been copied = way too slow.

Question: How can I sub-sample a 1-D array, that is basically a flattened 2-D array, without reshaping it?

import numpy

a1 = np.array([[1,2,3,4],
               [11,22,33,44],
               [111,222,333,444],
               [1111,2222,3333,4444],
               [11111,22222,33333,44444]])

a2 = np.ravel(a1)

rows, cols = a1.shape

row1 = 1
row2 = 3

col1 = 1
col2 = 3

I would like to use a fast slicing method that doesn't require reshaping the 1-D array to a 2-D array.

Desired Output:

np.ravel(a1[row1:row2, col1:col2])

>> array([ 22,  33, 222, 333])

I got as far as getting the start and ending positions, but this just selects ALL data between these points (i.e. extra columns).

idx_start = (row1 * cols) + col1
idx_end   = (row2 * cols) + col2

Update: I just tried Jaime's brilliant answer, but it appears that netCDF4 won't allow for 2-D indices.

z = dataset.variables["z"][idx]
  File "netCDF4.pyx", line 2613, in netCDF4.Variable.__getitem__ (netCDF4.c:29583)
  File "/usr/local/lib/python2.7/dist-packages/netCDF4_utils.py", line 141, in _StartCountStride
    raise IndexError("Index cannot be multidimensional.")
IndexError: Index cannot be multidimensional.
Community
  • 1
  • 1
ryanjdillon
  • 17,658
  • 9
  • 85
  • 110

3 Answers3

1

You can get what you want with a combination of np.ogrid and np.ravel_multi_index:

>>> a1
array([    1,     2,     3,     4,    11,    22,    33,    44,   111,
         222,   333,   444,  1111,  2222,  3333,  4444, 11111, 22222,
       33333, 44444])
>>> idx = np.ravel_multi_index((np.ogrid[1:3,1:3]), (5, 4))
>>> a1[idx]
array([[ 22,  33],
       [222, 333]])

You could of course ravel this array to get a 1D return if that's what you are after. Notice also that this is a copy of your original data, not a view.


EDIT You can keep the same general approach, replacing np.ogrid with np.mgrid and reshaping it to get a flat return:

>>> idx = np.ravel_multi_index((np.mgrid[1:3,1:3].reshape(2, -1)), (5, 4))
>>> a1[idx]
array([ 22,  33, 222, 333])
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • 1-D return doesn't matter as much as taking the slice from its original 1-D dimension. Checking this out now. Looks perfect. Thanks! – ryanjdillon Apr 26 '13 at 15:51
  • Should the `np.ravel_multi_index` `dims` of `(4,4)` not be `(5,4)` here? – ryanjdillon Apr 26 '13 at 15:59
  • @shootingstars Yes, my bad, I counted wrong, have edited the answer. – Jaime Apr 26 '13 at 16:05
  • Thanks. After trying this (awesome solution), It appears `netCDF4` doesn't like 2-D indices. Any suggestions? I've added the error to my question. – ryanjdillon Apr 26 '13 at 16:08
  • Just tried that and got a `Killed`. It may be something I did, but I'll have to recheck in the morning. Thanks for the help Jaime! – ryanjdillon Apr 26 '13 at 17:22
  • It looks like this method is too resource intensive given the size of the dataset that I am extracting. I've verified I'm doing things correctly, and it is still giving me a `killed` process. – ryanjdillon Apr 29 '13 at 11:20
0

I came up with this, and though it doesn't copy ALL of the data, it is still copying data that I don't want into memory. This can probably be improved and I hope there is a better solution out there.

zi = 0 
# Create zero array with the appropriate length for the data subset
z = np.zeros((col2 - col1) * (row2 - row1))
# Process number of rows for which data is being extracted
for i in range(row2 - row1):
    # Pull row, then desired elements of that row into buffer
    tmp = ((dataset.variables["z"][(i*cols):((i*cols)+cols)])[col1:col2])
    # Add each item in buffer sequentially to data array
    for j in tmp:
        z[zi] = j 
        # Keep a count of what index position the next data point goes to
        zi += 1
ryanjdillon
  • 17,658
  • 9
  • 85
  • 110
0

Here a lean proposition

a1 = np.array([[1,2,3,4],
               [11,22,33,44],
               [111,222,333,444],
               [1111,2222,3333,4444],
               [11111,22222,33333,44444]])

row1 = 1; row2 = 3; ix = slice(row1,row2)
col1 = 1; col2 = 3; iy = slice(col1,col2)
n = (row2-row1)*(col2-col1)

print(a1[ix,iy]);    print()
print(a1[ix,iy].reshape(1,n))
.
[[ 22  33]
 [222 333]]

[[ 22  33 222 333]]

reshape in Python is not expensive, and slice is fast.

pyano
  • 1,885
  • 10
  • 28