numpy shuffle a fraction of sub-arrays

Question

I have one-hot encoded data of undefined shape within an array of ndim = 3, e.g.,:

import numpy as np

arr = np.array([ # Axis 0
    [ # Axis 1
        [0, 1, 0], # Axis 2
        [1, 0, 0],
    ],
    [
        [0, 0, 1],
        [0, 1, 0],
    ],
])

What I want is to shuffle values for a known fraction of sub-arrays along axis=2.

If this fraction is 0.25, then the result could be:

arr = np.array([
    [
        [1, 0, 0], # Shuffling happened here
        [1, 0, 0],
    ],
    [
        [0, 0, 1],
        [0, 1, 0],
    ],
])

I know how to do that using iterative methods like:

for i in range(arr.shape[0]):
    for j in range(arr.shape[1]):
        if np.random.choice([0, 1, 2, 3]) == 0:
            np.random.shuffle(arr[i][j])

But this is extremely inefficient.

Edit: as suggested in the comments, the random selection of a known fraction should follow an uniform law.

Does this answer your question? [Shuffling NumPy array along a given axis](https://stackoverflow.com/questions/5040797/shuffling-numpy-array-along-a-given-axis) — yann ziselman, Oct 24 '21 at 11:04
@yannziselman please read my question and the post you mention. It is pretty clear why both are not related. — Synthase, Oct 24 '21 at 11:25
A note for the question and the answers: selecting sub-arrays with calling `np.random.randint(0, 4)==0` `N*M` times is clearly *not statistically equivalent* selecting a known 0.25 fraction of sub-arrays. The first method follows in a *binomial law* while the second is a *uniform law*. Using `np.random.random((N, M, 1)) > 0.25` follows in a binomial law while using `np.random.choice` with a fraction of 25% of the array follows a uniform law. I think you should clarify this in your question so the answers can match with your requirements. — Jérôme Richard, Oct 24 '21 at 11:25
@JérômeRichard thanks for helping to clarify, I edited my post. — Synthase, Oct 24 '21 at 11:30
Be careful: repeating a N experiments each following uniform law gives a binomial law overall. So I think calling `np.random.choice` outside the loop like what @DaniMesejo did match with your requirements. Anyway, thank you for clarifying this. — Jérôme Richard, Oct 24 '21 at 11:33

Gianluca Micchi · Answer 1 · 2021-10-24T11:12:36.357

Your iterative method is great and definitely the best solution in terms of number of logical operations involved. The only way to do better, up to my knowledge, is to take advantage of numpy's vectorisation speedup. The following code is an example

def permute_last_maybe(x):
    N, M, K = x.shape
    y = np.transpose(x, [2, 0, 1])
    y = np.random.permutation(y)
    y = np.transpose(y, [1, 2, 0])
    mask = (np.random.random((N, M, 1)) > 0.25) * np.ones([N, M, K])
    return np.where(mask, x, y)

A timeit magic shows 300 us instead of 4.2 ms with an array of size (40, 40, 30). Note that this code does NOT use the new random Generators from numpy (I tried, but the overload of creating an instance of the class was significant).

I should probably mention also that this function does not mutate the given array x but returns a copy of it.

I checked your solution, the one of @DaniMesejo and my iterative method on an array with dimensions more realistic w.r.t. my problem (i.e. (100000, 21, 21)) and you both achieve the task in ~0.6 sec. versus ~13 sec. for the iterative method. I may prefer the answer of because I am near of overloading my RAM which may happen inside this function, when 2 arrays co-exist in memory. Thank you, however! — Synthase, Oct 24 '21 at 11:52

score 1 · Accepted Answer · answered Oct 24 '21 at 11:05

One approach:

import numpy as np

np.random.seed(42)

fraction = 0.25
total = arr.shape[0] * arr.shape[1]

# pick arrays to be shuffled
indices = np.random.choice(np.arange(total), size=int(total * fraction), replace=False)

# convert the each index to the corresponding multi-index
multi_indices = np.unravel_index(indices, arr.shape[:2])

# create view using multi_indices
selected = arr[multi_indices]

# shuffle select by applying argsort on random values of the same shape
shuffled = np.take_along_axis(selected, np.argsort(np.random.random(selected.shape), axis=1), axis=1)

# set the array to the new values
arr[multi_indices] = shuffled
print(arr)

Output (of a single run)

[[[0 1 0]
  [0 0 1]]

 [[0 0 1]
  [0 1 0]]]

@Synthase Actually, I noticed that none of us has used the fact that your vectors are one-hot encoded. Creating new vectors should be equivalent to shuffling the existing one and is much faster. I tried to do so defining `shuffled = get_one_hot(np.random.randint(0, arr.shape[2], len(indices)), arr.shape[2])`, where `get_one_hot` is defined in this answer https://stackoverflow.com/a/42874726/5048010 , and removing the unnecessary lines in Dani's answer. The result was 4 times faster, not entirely sure what is the difference in memory use though. — Gianluca Micchi, Oct 24 '21 at 12:16

numpy shuffle a fraction of sub-arrays

2 Answers2