0

Let's say I have a simple 1D NumPy array:

x = np.random.rand(1000)

And I retrieve the sorted indices:

idx = np.argsort(x)

However, I need to move a list of indices to the front of idx. So, let's say indices = [10, 20, 30, 40, 50] need to always be the first 5 and then the rest will follow from idx (minus the indices found in indices)

A naive way to accomplish this would be:

indices = np.array([10, 20, 30, 40, 50])
out = np.empty(idx.shape[0], dtype=int64)
out[:indices.shape[0]] = indices

n = indices.shape[0]
for i in range(idx.shape[0]):
    if idx[i] not in indices:
        out[n] = idx[i] 
        n += 1

Is there a way to do this efficiently and, possibly, in-place?

slaw
  • 6,591
  • 16
  • 56
  • 109

2 Answers2

2

You can build a mask with where the indices are contained in idx with np.in1d, and just concatenate both indexing arrays:

m = np.in1d(idx, indices)
out = np.r_[indices, idx[~m]]
yatu
  • 86,083
  • 12
  • 84
  • 139
1

Approach #1

One way would be with np.isin masking -

mask = np.isin(idx, indices, invert=True)
out = np.r_[indices, idx[mask]]

Approach #2 : Skipping the first argsort

Another with making those given indices minimum, thus forcing them to be at the start with argsorting. We don't need idx for this method as we are argsort-ing in our solution anyway -

def argsort_constrained(x, indices):
    xc = x.copy()
    xc[indices] = x.min()-np.arange(len(indices),0,-1)
    return xc.argsort()

Benchmarking - Closer look

Let's study how does this entire thing of skipping the computation of starting argsort idx helps us with the second approach.

We will start off with the given sample :

In [206]: x = np.random.rand(1000)

In [207]: indices = np.array([10, 20, 30, 40, 50])

In [208]: %timeit argsort_constrained(x, indices)
38.6 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [209]: idx = np.argsort(x)

In [211]: %timeit np.argsort(x)
27.7 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [212]: %timeit in1d_masking(x, idx, indices)
     ...: %timeit isin_masking(x, idx, indices)
44.4 µs ± 421 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.7 µs ± 303 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Note that if you use np.concatenate in place of np.r_ with these small datasets, you could do better.

So, argsort_constrained has a total runtime cost of around 38.6 µs, whereas the other two with masking have around 27.7 µs on top of their individual timing numbers.

Let's scale up everything by 10x and do the same experiments :

In [213]: x = np.random.rand(10000)

In [214]: indices = np.sort(np.random.choice(len(x), 50, replace=False))

In [215]: %timeit argsort_constrained(x, indices)
740 µs ± 3.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [216]: idx = np.argsort(x)

In [217]: %timeit np.argsort(x)
731 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [218]: %timeit in1d_masking(x, idx, indices)
     ...: %timeit isin_masking(x, idx, indices)
1.07 ms ± 47.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.02 ms ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Again, the individual runtime costs with masking ones are higher than with argsort_constrained. And this trend should continue as we scale up further.

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I apologize in advance but I made a mistake. My input array is actually 2D and the output should actually be the sorted values from the array `x` and not the indices. I've update my question with the details but let me know if it is preferable to post a brand new separate question. Sorry for the noise! – slaw May 21 '20 at 01:41
  • Yes probably post a new question, and leave this one as is, since both answers solve the question as was originally posed @slaw – yatu May 21 '20 at 11:09
  • @yatu Thank you. Done! https://stackoverflow.com/questions/61936423/rearrange-2d-numpy-array-efficiently – slaw May 21 '20 at 14:09