8

Suppose I have this numpy array:

[[1, 2, 3, 4],
 [5, 6, 7, 8],
 [9, 10, 11, 12],
 [13, 14, 15, 16]]

My goal is to select two random elements from each row and create a new numpy array that might look something like:

[[2, 4],
 [5, 8],
 [9, 10],
 [15, 16]]

I can easily do this using a for loop. However, is there a way that I can use broadcasting, say, with np.random.choice, to avoid having to loop through each row?

user4793385
  • 145
  • 5

2 Answers2

10

Approach #1

Based on this trick, here's a vectorized way -

n = 2 # number of elements to select per row
idx = np.random.rand(*a.shape).argsort(1)[:,:n]
out = np.take_along_axis(a, idx, axis=1)

Sample run -

In [251]: a
Out[251]: 
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]])

In [252]: idx = np.random.rand(*a.shape).argsort(1)[:,:2]

In [253]: np.take_along_axis(a, idx, axis=1)
Out[253]: 
array([[ 2,  1],
       [ 6,  7],
       [ 9, 11],
       [16, 15]])

Approach #2

Another based on masks to select exactly two per row -

def select_two_per_row(a):
    m,n = a.shape
    mask = np.zeros((m,n), dtype=bool)
    R = np.arange(m)
    
    idx1 = np.random.randint(0,n,m)
    mask[R,idx1] = 1
    
    mask2 = np.zeros(m*(n-1), dtype=bool)
    idx2 = np.random.randint(0,n-1,m) + np.arange(m)*(n-1)
    mask2[idx2] = 1
    mask[~mask] = mask2
    out = a[mask].reshape(-1,2)
    return out

Approach #3

Another based on integer based indexing again to select exactly two per row -

def select_two_per_row_v2(a):
    m,n = a.shape
    idx1 = np.random.randint(0,n,m)
    idx2 = np.random.randint(1,n,m)
    out = np.take_along_axis(a, np.c_[idx1, idx1 - idx2], axis=1)
    return out

Timings -

In [209]: a = np.random.rand(100000,10)

# App1 with argsort
In [210]: %%timeit
     ...: idx = np.random.rand(*a.shape).argsort(1)[:,:2]
     ...: out = np.take_along_axis(a, idx, axis=1)
23.2 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# App1 with argpartition
In [221]: %%timeit
     ...: idx = np.random.rand(*a.shape).argpartition(axis=1,kth=1)[:,:2]
     ...: out = np.take_along_axis(a, idx, axis=1)
18.3 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [214]: %timeit select_two_per_row(a)
9.89 ms ± 37.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [215]: %timeit select_two_per_row_v2(a)
5.78 ms ± 9.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I think in approach 3 you can save on a modulo with `(idx2 - idx1)` or so. – Mad Physicist Sep 29 '20 at 20:59
  • @MadPhysicist Yup, good point. Got some improvement there. Thanks! – Divakar Sep 29 '20 at 21:07
  • @Divakar is distribution of `idx1 - idx2` in approach 3 uniform? I suspect this might be according to a different distribution than uniform random. – Ehsan Sep 30 '20 at 01:59
  • @Ehsan Well idx1 has uniform probability of selecting any one of the n elements per row. Then idx2 has the same among the remaining n-1 elements per row. So, I think it is good. What makes you suspect otherwise? – Divakar Sep 30 '20 at 05:35
  • @Divakar When `idx1` and `idx2` are both uniform, their subtraction `idx1-idx2` is not uniform anymore. It would be the convolution of two uniforms which will look like triangle distribution, hence the selection is non-uniform, if I understand correctly. – Ehsan Sep 30 '20 at 06:18
  • @Ehsan You would probably know more about distributions. My idea is simplistic one to make sure all combinations are covered per row with equal probability. And I see its covered. Here's some data - https://textuploader.com/1pl44 – Divakar Sep 30 '20 at 12:35
  • @Divakar I see negative indices overlap with positive ones and the two half of triangle overlap to make a final uniform distribution. Very good and interesting point. Thank you. Although the selection per row will not be sorted I guess. Not sure if that is a requirement of OP. – Ehsan Oct 01 '20 at 01:42
1

you can use numpy apply_along_axis

import numpy as np
x = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12],  [13, 14, 15, 16]])
print(np.apply_along_axis(np.random.choice, axis=1, arr=x, size=2))

Output:

[[ 4  1]
 [ 5  6]
 [10 12]
 [14 16]]
Ajay Verma
  • 610
  • 2
  • 12