Better way to shuffle two numpy arrays in unison

Question

I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.

This code works, and illustrates my goals:

def shuffle_in_unison(a, b):
    assert len(a) == len(b)
    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
    permutation = numpy.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b

For example:

>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
       [1, 1],
       [3, 3]]), array([2, 1, 3]))

However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.

Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.

One other thought I had was this:

def shuffle_in_unison_scary(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.

Six years later, I'm amused and surprised by how popular this question proved to be. And in a bit of delightful coincidence, for Go 1.10 I [contributed math/rand.Shuffle to the standard library](https://golang.org/cl/51891). The design of the API makes it trivial to shuffle two arrays in unison, and doing so is even included as an example in the docs. — Josh Bleecher Snyder, Dec 02 '17 at 01:53

score 475 · Answer 1 · edited Oct 20 '16 at 17:42

475

Your can use NumPy's array indexing:

def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = numpy.random.permutation(len(a))
    return a[p], b[p]

This will result in creation of separate unison-shuffled arrays.

edited Oct 20 '16 at 17:42

Íhor Mé

896
9
13

answered Jan 05 '11 at 08:52

mtrw

34,200
7
63
71

22

This *does* create copies, as it uses advanced indexing. But of course it is faster than the original. – Sven Marnach Jan 05 '11 at 10:24
1

@mtrw: The mere fact that the original arrays are untouched does not outrule that the returned arrays are views of the same data. But they are indeed not, since NumPy views are not flexible enough to support permuted views (this wouldn't be desirable either). – Sven Marnach Jan 05 '11 at 15:01
I tried this function with the time module and it is not faster than the previous one. Did I do something wrong? – Dat Chu Jan 05 '11 at 15:14
1

@Sven - I really have to learn about views. @Dat Chu - I just tried `>>> t = timeit.Timer(stmt = "(a,b)", setup = "import numpy as np; a,b = np.arange(4), np.arange(4*20).reshape((4,20))")>>> t.timeit()` and got 38 seconds for the OP's version, and 27.5 seconds for mine, for 1 million calls each. – mtrw Jan 05 '11 at 16:01
4

I really like the simplicity and readability of this, and advanced indexing continues to surprise and amaze me; for that this answer readily gets +1. Oddly enough, though, on my (large) datasets, it is slower than my original function: my original takes ~1.8s for 10 iterations, and this takes ~2.7s. Both numbers are quite consistent. The dataset I used to test has `a.shape` is `(31925, 405)` and `b.shape` is `(31925,)`. – Josh Bleecher Snyder Jan 05 '11 at 17:46
@Josh - yep, you're right. That's really odd. I tried a square array (NxN) and a linear array (N), and your original function is faster starting around N = 75. I'm perplexed. But, as Sven pointed out, your second idea of resetting the state of the RNG is probably the easiest way to go anyway. – mtrw Jan 05 '11 at 18:10
1

Maybe, the slowness has to do with the fact that you're not doing things in-place, but are instead creating new arrays. Or with some slowness related to how CPython parses array-indexes. – Íhor Mé Oct 20 '16 at 17:08
Thank you @mtrw for your service to Python. The gods favor you. – legel Dec 24 '20 at 13:32

score 214 · Answer 2 · answered Jun 04 '15 at 01:46

214

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)

To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html

answered Jun 04 '15 at 01:46

James

2,535
1
15
14

4

This solution creates [copies](https://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html) (_"The original arrays are not impacted"_), whereas the author's "scary" solution doesn't. – bartolo-otrit Mar 14 '20 at 09:52
1

You can choose any style as you like – James Mar 16 '20 at 06:20

score 81 · Accepted Answer · answered Jan 05 '11 at 11:35

81

Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.

If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.

Example: Let's assume the arrays a and b look like this:

a = numpy.array([[[  0.,   1.,   2.],
                  [  3.,   4.,   5.]],

                 [[  6.,   7.,   8.],
                  [  9.,  10.,  11.]],

                 [[ 12.,  13.,  14.],
                  [ 15.,  16.,  17.]]])

b = numpy.array([[ 0.,  1.],
                 [ 2.,  3.],
                 [ 4.,  5.]])

We can now construct a single array containing all the data:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[  0.,   1.,   2.,   3.,   4.,   5.,   0.,   1.],
#        [  6.,   7.,   8.,   9.,  10.,  11.,   2.,   3.],
#        [ 12.,  13.,  14.,  15.,  16.,  17.,   4.,   5.]])

Now we create views simulating the original a and b:

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).

In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.

This solution could be adapted to the case that a and b have different dtypes.

answered Jan 05 '11 at 11:35

Sven Marnach

574,206
118
941
841

1

Re: the scary solution: I just worry that arrays of different shapes could (conceivably) yield different numbers of calls to the rng, which would cause divergence. However, I think you are right that the current behavior is perhaps unlikely to change, and a very simple doctest does make confirming correct behavior very easy... – Josh Bleecher Snyder Jan 05 '11 at 17:49
I like your suggested approach, and could definitely arrange to have a and b start life as a unified c array. However, a and b will need to be contiguous shortly after shuffling (for efficient transfer to a GPU), so I think that, in my particular case, I'd end up making copies of a and b anyway. :( – Josh Bleecher Snyder Jan 05 '11 at 17:51
1

@Josh: Note that `numpy.random.shuffle()` operates on arbitrary mutable sequences, such as Python lists or NumPy arrays. The array shape does not matter, only the length of the sequence. This is *very* unlikely to change in my opinion. – Sven Marnach Jan 05 '11 at 19:11
I didn't know that. That makes me much more comfortable with it. Thank you. – Josh Bleecher Snyder Jan 05 '11 at 19:17
@SvenMarnach : I posted an answer below. Can you comment on whether you think it makes sense/ is a good way to do it? – ajfbiw.s Feb 10 '16 at 17:43
Is there a possibility that numpy will be updated to automatically change the RNG state whenever a random function is called? – Abhimanyu Pallavi Sudhir Jun 23 '20 at 11:56
@AbhimanyuPallaviSudhir I'm not quite sure what you are referring to. The RNG state does advance whenever you call a function using random bits – otherwise you'd get the same bits with every call. – Sven Marnach Jun 23 '20 at 13:42
@SvenMarnach Does it advance before or after calling such a random function? I'm guessing after. I'm saying what if they change that to before? (Presumably the next state is not a deterministic function of the current state, i.e. it depends on the current time, or something like that -- correct?) – Abhimanyu Pallavi Sudhir Jun 23 '20 at 16:34
@AbhimanyuPallaviSudhir The state is changed _during_ the function call, as part of generating a random number. It's neither before nor after. And the new state _is_ a deterministic, pure function of the old state, which is why good PRNGs have lots of entropy in their state. Usually you _seed_ the state with some non-deterministic entropy, but after that all further steps are deterministic. – Sven Marnach Jun 23 '20 at 20:04

score 46 · Answer 4 · answered Jun 08 '16 at 18:45

46

Very simple solution:

randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]

the two arrays x,y are now both randomly shuffled in the same way

answered Jun 08 '16 at 18:45

connor

461
4
5

7

This is equivalent to mtrw's solution. Your first two lines are just generating a permutation, but that can be done in one line. – Josh Bleecher Snyder Jun 09 '16 at 15:44

score 28 · Answer 5 · edited Aug 15 '18 at 12:52

28

James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)

edited Aug 15 '18 at 12:52

Massinissa

95
1
10

answered May 30 '18 at 13:58

Daniel

281
3
2

score 25 · Answer 6 · answered Mar 25 '19 at 23:23

25

from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array

# Data is currently unshuffled; we should shuffle 
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]

answered Mar 25 '19 at 23:23

benjaminjsanders

827
8
13

This is essentially identical to [the top-voted answer](https://stackoverflow.com/a/4602224/1711796). – Bernhard Barker Jul 12 '22 at 12:23

Isaac B · Answer 7 · 2018-07-25T19:54:25.323

21

Shuffle any number of arrays together, in-place, using only NumPy.

import numpy as np


def shuffle_arrays(arrays, set_seed=-1):
    """Shuffles arrays in-place, in the same order, along axis=0

    Parameters:
    -----------
    arrays : List of NumPy arrays.
    set_seed : Seed value if int >= 0, else seed is random.
    """
    assert all(len(arr) == len(arrays[0]) for arr in arrays)
    seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed

    for arr in arrays:
        rstate = np.random.RandomState(seed)
        rstate.shuffle(arr)

And can be used like this

a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])

shuffle_arrays([a, b, c])

A few things to note:

The assert ensures that all input arrays have the same length along their first dimension.
Arrays shuffled in-place by their first dimension - nothing returned.
Random seed within positive int32 range.
If a repeatable shuffle is needed, seed value can be set.

After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.

edited Jul 25 '18 at 19:54

answered Jul 25 '18 at 19:12

Isaac B

695
1
11
20

3

beautiful solution, this worked perfect for me. Even with arrays of 3+ axis – wprins Nov 01 '18 at 19:41
3

This is the correct answer. There is no reason to use the global np.random when you can pass around random state objects. – Erotemic Feb 04 '20 at 13:34
One `RandomState` could be used outside of the loop. See Adam Snaider's [answer](https://stackoverflow.com/a/47584567/704244) – bartolo-otrit Mar 14 '20 at 11:04
1

@bartolo-otrit, the choice that has to be made in the `for` loop is whether to reassign or reseed random state. With the number of arrays being passed into a shuffling function expected to be small, I wouldn't expect a performance difference between the two. But yes, rstate could be assigned outside the loop and reseeded inside the loop on each iteration. – Isaac B Mar 15 '20 at 18:35

score 13 · Answer 8 · answered Apr 01 '18 at 11:59

13

you can make an array like:

s = np.arange(0, len(a), 1)

then shuffle it:

np.random.shuffle(s)

now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.

x_data = x_data[s]
x_label = x_label[s]

answered Apr 01 '18 at 11:59

mohammad hassan bigdeli shamlo

756
1
6
19

Really, this is the best solution, and should be the accepted one! It even works for many (more than 2) arrays at the same time. The idea is simple: just shuffle the index list [0, 1, 2, ..., n-1] , and then reindex the arrays' rows with the shuffled indexes. Nice! – Basj Nov 16 '18 at 15:24

sziraqui · Answer 9 · 2018-11-08T11:47:10.643

7

There is a well-known function that can handle this:

from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)

Just setting test_size to 0 will avoid splitting and give you shuffled data. Though it is usually used to split train and test data, it does shuffle them too.
From documentation

Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

edited Nov 08 '18 at 11:47

answered Nov 07 '18 at 19:55

sziraqui

5,763
3
28
37

1

I can not believe I never thought of this. Your answer is brilliant. – curiouscupcake Nov 19 '19 at 15:13
Has something changed in sklearn? This solution is not working for me and throwing a ValueError. – theProcrastinator Jun 02 '21 at 10:13
I don't see any changes in this function. Check if you are passing correct data type (any array-like type will work) and also check if the arrays have same shape. – sziraqui Jun 04 '21 at 18:09

andrea m. · Answer 10 · 2020-05-21T17:39:02.980

7

This seems like a very simple solution:

import numpy as np
def shuffle_in_unison(a,b):

    assert len(a)==len(b)
    c = np.arange(len(a))
    np.random.shuffle(c)

    return a[c],b[c]

a =  np.asarray([[1, 1], [2, 2], [3, 3]])
b =  np.asarray([11, 22, 33])

shuffle_in_unison(a,b)
Out[94]: 
(array([[3, 3],
        [2, 2],
        [1, 1]]),
 array([33, 22, 11]))

edited May 21 '20 at 17:39

answered Apr 17 '20 at 03:00

andrea m.

668
7
15

Adam Snaider · Answer 11 · 2018-02-20T04:35:12.613

6

One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.

# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
   np.random.seed(seed)
   np.random.shuffle(a)
   np.random.seed(seed)
   np.random.shuffle(b)

That's it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.

EDIT, don't use np.random.seed() use np.random.RandomState instead

def shuffle(a, b, seed):
   rand_state = np.random.RandomState(seed)
   rand_state.shuffle(a)
   rand_state.seed(seed)
   rand_state.shuffle(b)

When calling it just pass in any seed to feed the random state:

a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)

Output:

>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]

Edit: Fixed code to re-seed the random state

edited Feb 20 '18 at 04:35

answered Nov 30 '17 at 23:56

Adam Snaider

188
1
8

This code does not work. `RandomState` changes state on the first call and `a` and `b` are not shuffled in unison. – Bruno Klein Jan 24 '18 at 20:47
@BrunoKlein You are right. I fixed the post to re-seed the random state. Also, even though it is not in unison in the sense of both lists being shuffled at the same time, they are in unison in the sense that both are shuffled in the same way, and it also doesn't require more memory to hold a copy of the lists (which OP mentions in his question) – Adam Snaider Feb 20 '18 at 04:38

score 2 · Answer 12 · answered Dec 05 '18 at 18:30

Say we have two arrays: a and b.

a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]])

We can first obtain row indices by permutating first dimension

indices = np.random.permutation(a.shape[0])
[1 2 0]

Then use advanced indexing. Here we are using the same indices to shuffle both arrays in unison.

a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]

This is equivalent to

np.take(a, indices, axis=0)
[[4 5 6]
 [7 8 9]
 [1 2 3]]

np.take(b, indices, axis=0)
[[6 6 6]
 [4 2 0]
 [9 1 1]]

Why not just a[indices,:] or b[indices,:]? – Kev Jan 31 '19 at 17:40 — Kev, Jan 31 '19 at 17:40

score 2 · Answer 13 · answered May 28 '22 at 12:17

most solutions above work, however if you have column vectors you have to transpose them first. here is an example

def shuffle(self) -> None:
    """
    Shuffles X and Y
    """
    x = self.X.T
    y = self.Y.T
    p = np.random.permutation(len(x))
    self.X = x[p].T
    self.Y = y[p].T

DaveP · Answer 14 · 2011-01-05T08:29:36.213

1

If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array

for old_index in len(a):
    new_index = numpy.random.randint(old_index+1)
    a[old_index], a[new_index] = a[new_index], a[old_index]
    b[old_index], b[new_index] = b[new_index], b[old_index]

This implements the Knuth-Fisher-Yates shuffle algorithm.

edited Jan 05 '11 at 08:29

answered Jan 05 '11 at 07:38

DaveP

6,952
1
24
37

3

http://www.codinghorror.com/blog/2007/12/the-danger-of-naivete.html has made me wary of implementing my own shuffle algorithms; it is in part responsible for my asking this question. :) However, you are very right to point out that I should consider using the Knuth-Fisher-Yates algorithm. – Josh Bleecher Snyder Jan 05 '11 at 07:54
Well spotted, I've fixed the code now. Anyway, I think the basic idea of in-place shuffling is scalable to an arbitrary number of arrays an avoids making copies. – DaveP Jan 05 '11 at 08:32
The code is still incorrect (it won't even run). To make it work, replace `len(a)` by `reversed(range(1, len(a)))`. But it won't be very efficient anyway. – Sven Marnach Jan 05 '11 at 10:49

score 1 · Answer 15 · answered Sep 22 '21 at 01:34

1

Shortest and easiest way in my opinion, use seed:

random.seed(seed)
random.shuffle(x_data)
# reset the same seed to get the identical random sequence and shuffle the y
random.seed(seed)
random.shuffle(y_data)

answered Sep 22 '21 at 01:34

momo668

199
3
8

score 0 · Answer 16 · answered Feb 10 '16 at 05:52

0

With an example, this is what I'm doing:

combo = []
for i in range(60000):
    combo.append((images[i], labels[i]))

shuffle(combo)

im = []
lab = []
for c in combo:
    im.append(c[0])
    lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)

answered Feb 10 '16 at 05:52

ajfbiw.s

401
1
8
22

2

This is more or less equivalent to `combo = zip(images, labels); shuffle(combo); im, lab = zip(*combo)`, just slower. Since you are using Numpy anyway, a yet much faster solution would be to zip the arrays using Numpy `combo = np.c_[images, labels]`, shuffle, and unzip again `images, labels = combo.T`. Assuming that `labels` and `images` are one-dimensional Numpy arrays of the same length to begin with, this will be easily the fastest solution. If they are multi-dimensional, see my answer above. – Sven Marnach Feb 10 '16 at 18:07
Ok that makes sense. Thanks! @SvenMarnach – ajfbiw.s Feb 10 '16 at 18:09

score 0 · Answer 17 · answered Oct 30 '17 at 14:15

I extended python's random.shuffle() to take a second arg:

def shuffle_together(x, y):
    assert len(x) == len(y)

    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random.random() * (i+1))
        x[i], x[j] = x[j], x[i]
        y[i], y[j] = y[j], y[i]

That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.

score 0 · Answer 18 · edited Apr 02 '20 at 23:30

Just use numpy...

First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method. Finally split them and return.

import numpy as np

def shuffle_2d(a, b):
    rows= a.shape[0]
    if b.shape != (rows,1):
        b = b.reshape((rows,1))
    S = np.hstack((b,a))
    np.random.shuffle(S)
    b, a  = S[:,0], S[:,1:]
    return a,b

features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)

Better way to shuffle two numpy arrays in unison

18 Answers18

EDIT, don't use np.random.seed() use np.random.RandomState instead

Linked

Related