0

I have one-hot encoded data of undefined shape within an array of ndim = 3, e.g.,:

import numpy as np

arr = np.array([ # Axis 0
    [ # Axis 1
        [0, 1, 0], # Axis 2
        [1, 0, 0],
    ],
    [
        [0, 0, 1],
        [0, 1, 0],
    ],
])

What I want is to shuffle values for a known fraction of sub-arrays along axis=2.

If this fraction is 0.25, then the result could be:

arr = np.array([
    [
        [1, 0, 0], # Shuffling happened here
        [1, 0, 0],
    ],
    [
        [0, 0, 1],
        [0, 1, 0],
    ],
])

I know how to do that using iterative methods like:

for i in range(arr.shape[0]):
    for j in range(arr.shape[1]):
        if np.random.choice([0, 1, 2, 3]) == 0:
            np.random.shuffle(arr[i][j])

But this is extremely inefficient.

Edit: as suggested in the comments, the random selection of a known fraction should follow an uniform law.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Synthase
  • 5,849
  • 2
  • 12
  • 34
  • Does this answer your question? [Shuffling NumPy array along a given axis](https://stackoverflow.com/questions/5040797/shuffling-numpy-array-along-a-given-axis) – yann ziselman Oct 24 '21 at 11:04
  • 1
    @yannziselman please read my question and the post you mention. It is pretty clear why both are not related. – Synthase Oct 24 '21 at 11:25
  • A note for the question and the answers: selecting sub-arrays with calling `np.random.randint(0, 4)==0` `N*M` times is clearly *not statistically equivalent* selecting a known 0.25 fraction of sub-arrays. The first method follows in a *binomial law* while the second is a *uniform law*. Using `np.random.random((N, M, 1)) > 0.25` follows in a binomial law while using `np.random.choice` with a fraction of 25% of the array follows a uniform law. I think you should clarify this in your question so the answers can match with your requirements. – Jérôme Richard Oct 24 '21 at 11:25
  • @JérômeRichard thanks for helping to clarify, I edited my post. – Synthase Oct 24 '21 at 11:30
  • Be careful: repeating a N experiments each following uniform law gives a binomial law overall. So I think calling `np.random.choice` outside the loop like what @DaniMesejo did match with your requirements. Anyway, thank you for clarifying this. – Jérôme Richard Oct 24 '21 at 11:33

2 Answers2

1

Your iterative method is great and definitely the best solution in terms of number of logical operations involved. The only way to do better, up to my knowledge, is to take advantage of numpy's vectorisation speedup. The following code is an example

def permute_last_maybe(x):
    N, M, K = x.shape
    y = np.transpose(x, [2, 0, 1])
    y = np.random.permutation(y)
    y = np.transpose(y, [1, 2, 0])
    mask = (np.random.random((N, M, 1)) > 0.25) * np.ones([N, M, K])
    return np.where(mask, x, y)

A timeit magic shows 300 us instead of 4.2 ms with an array of size (40, 40, 30). Note that this code does NOT use the new random Generators from numpy (I tried, but the overload of creating an instance of the class was significant).

I should probably mention also that this function does not mutate the given array x but returns a copy of it.

Gianluca Micchi
  • 1,584
  • 15
  • 32
  • 1
    I checked your solution, the one of @DaniMesejo and my iterative method on an array with dimensions more realistic w.r.t. my problem (i.e. (100000, 21, 21)) and you both achieve the task in ~0.6 sec. versus ~13 sec. for the iterative method. I may prefer the answer of because I am near of overloading my RAM which may happen inside this function, when 2 arrays co-exist in memory. Thank you, however! – Synthase Oct 24 '21 at 11:52
1

One approach:

import numpy as np

np.random.seed(42)

fraction = 0.25
total = arr.shape[0] * arr.shape[1]

# pick arrays to be shuffled
indices = np.random.choice(np.arange(total), size=int(total * fraction), replace=False)

# convert the each index to the corresponding multi-index
multi_indices = np.unravel_index(indices, arr.shape[:2])

# create view using multi_indices
selected = arr[multi_indices]

# shuffle select by applying argsort on random values of the same shape
shuffled = np.take_along_axis(selected, np.argsort(np.random.random(selected.shape), axis=1), axis=1)

# set the array to the new values
arr[multi_indices] = shuffled
print(arr)

Output (of a single run)

[[[0 1 0]
  [0 0 1]]

 [[0 0 1]
  [0 1 0]]]
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • @Synthase Actually, I noticed that none of us has used the fact that your vectors are one-hot encoded. Creating new vectors should be equivalent to shuffling the existing one and is much faster. I tried to do so defining `shuffled = get_one_hot(np.random.randint(0, arr.shape[2], len(indices)), arr.shape[2])`, where `get_one_hot` is defined in this answer https://stackoverflow.com/a/42874726/5048010 , and removing the unnecessary lines in Dani's answer. The result was 4 times faster, not entirely sure what is the difference in memory use though. – Gianluca Micchi Oct 24 '21 at 12:16