4

I have a numpy array of data where I need to keep only n highest values, and zero everything else.

My current solution:

import numpy as np
np.random.seed(30)

# keep only the n highest values
n = 3

# Simple 2x5 data field for this example, real life application will be exteremely large
data = np.random.random((2,5))
#[[ 0.64414354  0.38074849  0.66304791  0.16365073  0.96260781]
# [ 0.34666184  0.99175099  0.2350579   0.58569427  0.4066901 ]]


# find indices of the n highest values per row
idx = np.argsort(data)[:,-n:]
#[[0 2 4]
# [4 3 1]]


# put those values back in a blank array
data_ = np.zeros(data.shape) # blank slate
for i in xrange(data.shape[0]):
    data_[i,idx[i]] = data[i,idx[i]]

# Each row contains only the 3 highest values per row or the original data
#[[ 0.64414354  0.          0.66304791  0.          0.96260781]
# [ 0.          0.99175099  0.          0.58569427  0.4066901 ]]

In the code above, data_ has the n highest values and everything else is zeroed out. This works out nicely even if data.shape[1] is smaller than n. But the only issue is the for loop, which is slow because my actual use case is on very very large arrays.

Is it possible to get rid of the for loop?

Fnord
  • 5,365
  • 4
  • 31
  • 48

1 Answers1

7

You could act on the result of np.argsort -- np.argsort twice, the first to get the index order and the second to get the ranks -- in a vectorized fashion, and then use either np.where or simply multiplication to zero everything else:

In [116]: np.argsort(data)
Out[116]: 
array([[3, 1, 0, 2, 4],
       [2, 0, 4, 3, 1]])

In [117]: np.argsort(np.argsort(data))  # these are the ranks
Out[117]: 
array([[2, 1, 3, 0, 4],
       [1, 4, 0, 3, 2]])

In [118]: np.argsort(np.argsort(data)) >= data.shape[1] - 3
Out[118]: 
array([[ True, False,  True, False,  True],
       [False,  True, False,  True,  True]], dtype=bool)

In [119]: data * (np.argsort(np.argsort(data)) >= data.shape[1] - 3)
Out[119]: 
array([[ 0.64414354,  0.        ,  0.66304791,  0.        ,  0.96260781],
       [ 0.        ,  0.99175099,  0.        ,  0.58569427,  0.4066901 ]])

In [120]: np.where(np.argsort(np.argsort(data)) >= data.shape[1]-3, data, 0)
Out[120]: 
array([[ 0.64414354,  0.        ,  0.66304791,  0.        ,  0.96260781],
       [ 0.        ,  0.99175099,  0.        ,  0.58569427,  0.4066901 ]])
DSM
  • 342,061
  • 65
  • 592
  • 494
  • I edited my solution for clarity. Yours doesn't yield n highest values per row using the data above. Try to use your solution with the same data to see the difference. – Fnord Nov 04 '17 at 04:51
  • @Fnord: oops, forgot an extra `argsort`. It takes another argsort to get it to behave like rankdata (I'm used to just using .rank('dense') on a Series or DataFrame.) – DSM Nov 04 '17 at 05:09