4

Let's say I have a 2D NumPy array:

x = np.random.rand(100, 100000)

And I retrieve the column-wise sorted indices (i.e., each column is sorted independently from the others and the indices are returned):

idx = np.argsort(x, axis=0) 

Then, for each column, I need the values from indices = [10, 20, 30, 40, 50] to be first the first 5 rows (of that column) and then followed by the rest of the sorted values (not the indices!).

A naive approach might be:

indices = np.array([10, 20, 30, 40, 50])
out = np.empty(x.shape, dtype=int64)

for col in range(x.shape[1]):
    # For each column, fill the first few rows with `indices`
    out[:indices.shape[0], col] = x[indices, col]  # Note that we want the values, not the indices

    # Then fill the rest of the rows in this column with the remaining sorted values excluding `indices`
    n = indices.shape[0]
    for row in range(indices.shape[0], x.shape[0]):
        if idx[row, col] not in indices:
            out[n, col] = x[row, col]  # Again, note that we want the value, not the index
            n += 1
slaw
  • 6,591
  • 16
  • 56
  • 109

4 Answers4

1

Approach #1

Here's one based on previous post that doesn't need idx -

xc = x.copy()
xc[indices] = (xc.min()-np.arange(len(indices),0,-1))[:,None]
out = np.take_along_axis(x,xc.argsort(0),axis=0)

Approach #2

Another with np.isin masking that uses idx -

mask = np.isin(idx, indices)
p2 = np.take_along_axis(x,idx.T[~mask.T].reshape(x.shape[1],-1).T,axis=0)
out = np.vstack((x[indices],p2))

Approach #2- Alternative If you are continously editing into out to change everything except those indices, an array-assignment might be the one for you -

n = len(indices)
out[:n] = x[indices]

mask = np.isin(idx, indices)
lower = np.take_along_axis(x,idx.T[~mask.T].reshape(x.shape[1],-1).T,axis=0)
out[n:] = lower
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Let's say I had to perform this multiple times (rather than once) but the size of `out` is the same in all of my iterations. Would it be "better" or more efficient to create an `out` array once with the appropriate size and then copy `x[indices]` and `p2` into it? This way, I can avoid the costly memory creation? – slaw May 21 '20 at 14:33
  • @slaw Think I am getting what you mean. See if `Approach #2- Alternative` works for you. Will edit `Approach #1` accordingly. – Divakar May 21 '20 at 14:38
  • Yes, I think that would work and avoids the `np.vstack`! I guess, technically, we could just do `out[n:] = np.take_along_axis(...)` and probably reuse the `mask` across the iterations as well. I'm going to read up on `take_long_axis` just so I understand what is going on there. – slaw May 21 '20 at 14:43
0

This should help you get started by eliminating the inner most loop and if condition. To start off, you can pass in x[:, col] as the input parameter x.

def custom_ordering(x, idx, indices):
    # First get only the desired indices at the top
    out = x[indices, :]

    # delete `indices` from `idx` so `idx` doesn't have the values in `indices`
    idx2 = np.delete(idx, indices)

    # select `idx2` rows and concatenate
    out = np.concatenate((out, x[idx2, :]), axis=0)

    return out
varagrawal
  • 2,909
  • 3
  • 26
  • 36
0

Here is my solution to the problem:

rem_indices = [_ for _ in range(x.shape[0]) if _ not in indices]    # get all remaining indices
xs = np.take_along_axis(x, idx, axis = 0)                                        # the sorted array
out = np.empty(x.shape)

out[:indices.size, :] = xs[indices, :]                                                  # insert specific values at the beginning
out[indices.size:, :] = xs[rem_indices, :]                                         # insert the remaining values after the previous

Tell me if I understood your problem correctly.

amzon-ex
  • 1,645
  • 1
  • 6
  • 28
0

I do this with a smaller array and fewer indices such that I can easily sanity check the results, but it should translate to your use case. I think this solution is decently efficient as everything is done in place.

import numpy as np
x = np.random.randint(10, size=(12,3)) 
indices = np.array([5,7,9])

# Swap top 3 rows with the rows 5,7,9 and vice versa
x[:len(indices)], x[indices] = x[indices], x[:len(indices)].copy()
# Sort the wanted portion of array
x[len(indices):].sort(axis=0) 

Here is with the output:

>>> import numpy as np
>>> x = np.random.randint(10, size=(10,3))
>>> indices = np.array([5,7,9])
>>> x
array([[7, 1, 8],
       [7, 4, 6],
       [6, 5, 2],
       [6, 8, 4],
       [2, 0, 2],
       [3, 0, 4],  # 5th row
       [4, 7, 4],
       [3, 1, 1],  # 7th row
       [3, 5, 3],
       [0, 5, 9]]) # 9th row

>>> # We want top of array to be
>>> x[indices]
array([[3, 0, 4],
       [3, 1, 1],
       [0, 5, 9]])

>>> # Swap top 3 rows with the rows 5,7,9 and vice versa
>>> x[:len(indices)], x[indices] = x[indices], x[:len(indices)].copy()
>>> # Assert that rows have been swapped correctly
>>> x
array([[3, 0, 4],  #
       [3, 1, 1],  # Top of array looks like above
       [0, 5, 9],  #
       [6, 8, 4],
       [2, 0, 2],
       [7, 1, 8],  # Previous top row
       [4, 7, 4],
       [7, 4, 6],  # Previous second row
       [3, 5, 3],
       [6, 5, 2]]) # Previous third row

>>> # Sort the wanted portion of array
>>> x[len(indices):].sort(axis=0)
>>> x
array([[3, 0, 4], #
       [3, 1, 1], # Top is the same, below is sorted
       [0, 5, 9], #
       [2, 0, 2],
       [3, 1, 2],
       [4, 4, 3],
       [6, 5, 4],
       [6, 5, 4],
       [7, 7, 6],
       [7, 8, 8]])

EDIT: This version here should handle if any elements in indices is less than len(indices)

import numpy as np
x = np.random.randint(10, size=(12,3)) 
indices = np.array([1,2,4])

tmp = x[indices]

# Here I just assume that there aren't any values less or equal to -1. If you use 
# float, you can use -np.inf, but there is no such equivalent for ints (which I 
# use in my example).
x[indices] = -1

# The -1 will create dummy rows that will get sorted to be on top of the array,
# which can switch with tmp later
x.sort(axis=0) 
x[indices] = tmp
Naphat Amundsen
  • 1,519
  • 1
  • 6
  • 17
  • 1
    Ohhh, interesting! I really like that this can be done in place AND without extra steps that require no mental gymnastics. A simple swap before sorting is quite elegant – slaw May 21 '20 at 20:07
  • Little nitpick: If you need to guarantee a stable sort this will not work. – Paul Panzer May 21 '20 at 23:18
  • @PaulPanzer When you say “stable sort” are you referring to cases in which there is a tie? I thought that only mattered in the case of argsort? – slaw May 21 '20 at 23:25
  • @slaw good point. If you ultimately are only interested in values and values that compare equal are truly indiscriminable then "stable" doesn't mean much. – Paul Panzer May 22 '20 at 00:04
  • Hmmm, unfortunately, I'm getting a race condition or a FIFO situation in the swapping line. I think it would be safer to have the swap happen through an intermediate array – slaw May 22 '20 at 01:00
  • So, for an `x = np.array([[0, 0], [1, 1]])`, `x[0], x[1] = x[1], x[0]` would produce `[[1, 1], [1, 1]]`. But you could do `tmp[:] = x[0]`, `x[0] = x[1]`, and then `x[1] = tmp`. – slaw May 22 '20 at 01:14
  • An explanation to your problem is that numpy's basic indexing does not really copy data, so your method of explicitly using `tmp` will also fail (it is also equivalent to doing the one liner, its just that the one liner does it behind the scenes). To make it work you have to do `tmp = x[0].copy(), etc...`. You can find more information about that in this answer: https://stackoverflow.com/questions/4857927/swapping-columns-in-a-numpy-array/4857981#4857981 – Naphat Amundsen May 22 '20 at 07:23
  • You could use advanced indexing in your example above, you could do `x[0], x[1] = x[[1]], x[[0]]`, which is essentially `x[0], x[1] = x[1].copy(), x[0].copy()`. With that said, I think my swapping line would be more safe if if it was like this: `x[:len(indices)], x[indices] = x[indices], x[:len(indices)].copy()`. `x[indices]` is using advanced indexing so it implicitly copies, but `x[:len(indices)]` is actually basic indexing which does not copy. I'll edit it in my answer – Naphat Amundsen May 22 '20 at 07:39
  • After double checking the output, yes it is necessary the have the last copy, that is my one liner should be `x[:len(indices)], x[indices] = x[indices], x[:len(indices)].copy()`. I'll update my answer with new values as well! Good that you sanity checked your stuff! – Naphat Amundsen May 22 '20 at 07:45
  • Actually, my `tmp` approach (I pre-define `tmp` upfront for re-use) would not fail because `tmp[:] = x[0]` is not a view, it is copying values over into an index range (I am currently using this). `tmp = x[0]` (without `[:]`) would be a view and could be altered along the way. I think the use of a `tmp` array along with `tmp[:]` is the correct way. – slaw May 22 '20 at 20:55
  • There is one other issue and that is if the indices are within `len(indices)`. For example, if `indices = np.array([1, 2, 4])`, then the results are wrong.So, we have to keep track of this (whether `indices < len(indices)` and accommodate this) AND respect the order of `indices` so that they appear in that order at the top of `x` – slaw May 22 '20 at 21:27
  • That certainly makes the problem more hairy to deal with. Tbh I was hoping that such a case wouldn't happen. I think the general swap and sort method wont work then, well at least the swapping gotta be "smarter" (as you say). – Naphat Amundsen May 22 '20 at 21:39
  • Another thing you can do is `tmp = x[indices]`, then you do `x[indices] = -np.inf` (or any value less than your lowest bound), then `x.sort(axis=0)`. `x[:len(indices)]` should then only contain `-np,inf`, which you can simply assign `tmp` to, that is `x[indices] = tmp`. But the code wouldn't really be as "elegant" as I'd like though. – Naphat Amundsen May 22 '20 at 21:46
  • I added a little code section in the bottom of my answer for clarity – Naphat Amundsen May 22 '20 at 21:54