8

I have a numpy 2D array, and I would like to select different sized ranges of this array, depending on the column index. Here is the input array a = np.reshape(np.array(range(15)), (5, 3)) example

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]]

Then, list b = [4,3,1] determines the different range sizes for each column slice, so that we would get the arrays

[0 3 6 9]
[1 4 7]
[2]

which we can concatenate and flatten to get the final desired output

[0 3 6 9 1 4 7 2]

Currently, to perform this task, I am using the following code

slices = []
for i in range(a.shape[1]):
    slices.append(a[:b[i],i])

c = np.concatenate(slices)

and, if possible, I want to convert it to a pythonic format.

Bonus: The same question but now considering that b determines row slices instead of columns.

xicocaio
  • 867
  • 1
  • 10
  • 27

1 Answers1

6

We can use broadcasting to generate an appropriate mask and then masking does the job -

In [150]: a
Out[150]: 
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [151]: b
Out[151]: [4, 3, 1]

In [152]: mask = np.arange(len(a))[:,None] < b

In [153]: a.T[mask.T]
Out[153]: array([0, 3, 6, 9, 1, 4, 7, 2])

Another way to mask would be -

In [156]: a.T[np.greater.outer(b, np.arange(len(a)))]
Out[156]: array([0, 3, 6, 9, 1, 4, 7, 2])

Bonus : Slice per row

If we are required to slice per row based on chunk sizes, we would need to modify few things -

In [51]: a
Out[51]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

# slice lengths per row
In [52]: b
Out[52]: [4, 3, 1]

# Usual loop based solution :
In [53]: np.concatenate([a[i,:b_i] for i,b_i in enumerate(b)])
Out[53]: array([ 0,  1,  2,  3,  5,  6,  7, 10])

# Vectorized mask based solution :
In [54]: a[np.greater.outer(b, np.arange(a.shape[1]))]
Out[54]: array([ 0,  1,  2,  3,  5,  6,  7, 10])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • This solution was very clever, much obliged! Also, this was very enlightening, as I did not know this broadcasting concept, and it seems very useful. As a side note, I just verified that for the second approach, the outer method uses an internal product, so it seems it would be a little slower, is that right? For massive datasets, do you think the difference in speed would be significant? – xicocaio Aug 12 '20 at 17:34
  • @xicocaio Second one avoids the transpose, but transpose won't copy. So, in the end, I would think that these two should be comparable. – Divakar Aug 12 '20 at 17:57
  • Ok, but isn't the `outer` product of the two vectors expensive for very big arrays? – xicocaio Aug 12 '20 at 18:05
  • [152] is an 'outer' comparison too. The total number of comparisons is the same. The difference is a more a matter of syntax than actual 'work'. – hpaulj Aug 12 '20 at 18:25
  • @hpaulj Ok, so I broke down and investigated the first approach, and I think I understand what you are saying. The idea is that the `< b` will implicitly execute an array product and the comparison, right? – xicocaio Aug 12 '20 at 18:54
  • 1
    first approach: `3.13 µs ± 20.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)` second approach: `3.72 µs ± 421 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)` Not the greatest test cause I used the `5x3` array. But the std. dev. suggests the first approach would scale better. – Prox Oct 29 '22 at 22:52