2

I have multiple numpy arrays with the same number of rows (axis_0) that I'd like to shuffle in unison. After one shuffle, I'd like to shuffle them again with a different random seed.


Till now, I've used the solution from Better way to shuffle two numpy arrays in unison :

def shuffle_in_unison(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

However, this doesn't work for multiple unison shuffles, since rng_state is always the same.


I've tried to use RandomState in order to get a different seed for each call, but this doesn't even work for a single unison shuffle:
a = np.array([1,2,3,4,5])
b = np.array([10,20,30,40,50])

def shuffle_in_unison(a, b):
    r = np.random.RandomState() # different state from /dev/urandom for each call
    state = r.get_state()
    np.random.shuffle(a) # array([4, 2, 1, 5, 3])
    np.random.set_state(state)
    np.random.shuffle(b) # array([40, 20, 50, 10, 30])
    # -> doesn't work
    return a,b

for i in xrange(10):
    a,b = shuffle_in_unison(a,b)
    print a,b

What am I doing wrong?



Edit:

For everyone that doesn't have huge arrays like me, just use the solution by Francesco (https://stackoverflow.com/a/47156309/3955022):

def shuffle_in_unison(a, b):
    n_elem = a.shape[0]
    indeces = np.random.permutation(n_elem)
    return a[indeces], b[indeces]

The only drawback is that this is not an in-place operation, which is a pity for large arrays like mine (500G).

0vbb
  • 839
  • 11
  • 27

3 Answers3

4

I don't know what are you doing wrong with the way you set the state. However I found an alternative solution: instead of shuffling n arrays, shuffle their indeces only once with numpy.random.choice and then reorder all the arrays.

a = np.array([1,2,3,4,5])
b = np.array([10,20,30,40,5])

def shuffle_in_unison(a, b):
     n_elem = a.shape[0]
     indeces = np.random.choice(n_elem, size=n_elem, replace=False)
     return a[indeces], b[indeces]

 for i in xrange(5):
     a, b = shuffle_in_unison(a ,b)
     print(a, b)

I get:

[5 2 4 3 1] [50 20 40 30 10]
[1 3 4 2 5] [10 30 40 20 50]
[1 2 5 4 3] [10 20 50 40 30]
[3 2 1 4 5] [30 20 10 40 50]
[1 2 5 3 4] [10 20 50 30 40]

edit

Thanks to @Divakar for the suggestion. Here is a more readable way to obtain the same result using numpy.random.premutation

def shuffle_in_unison(a, b):
     n_elem = a.shape[0]
     indeces = np.random.permutation(n_elem)
     return a[indeces], b[indeces]
Francesco Montesano
  • 8,485
  • 2
  • 40
  • 64
  • 1
    this ought to be the accepted answer. this is idiomatic numpy--ie, perform a single shuffle on an integer index, then use this 1D array index to re-order both 2D arrays. By analogy, this is obviously how you would sort one array then re-order a second array based on the same ordering – doug Nov 07 '17 at 11:15
  • 1
    `np.random.permutation()` is marginally better on performance and compact way to express that. – Divakar Nov 07 '17 at 11:21
  • @Divakar: right I missed that the input can be a number. Thank you for poing this out. I'll edit my answer – Francesco Montesano Nov 07 '17 at 11:23
  • Thanks a lot for your solution! It seems to be about 10% slower than the set_state method (with 100MB arrays), but that's ok. Does `a[indeces]` return a shuffled copy of the array or does it give a view (-> it's not shuffled in memory)? And just to make sure, if I save the shuffled arrays e.g. with h5py and chunked storage, the row ordering on disk should be the same as in `a[indeces]`, right? – 0vbb Nov 07 '17 at 13:04
  • ``a[indeces]`` returns a copy of the original array with the values reodered. I think that most of the slowdown is because the reodering is not done inplace and you need to create a new numpy array. I don't know ``h5py`` so I can't answer the second part. – Francesco Montesano Nov 07 '17 at 13:12
  • Do you know how `np.random.shuffle` does it? I have the feeling that it modifies it in-place (and also based on the docs?), which is really handy since I have 500GB arrays. – 0vbb Nov 07 '17 at 13:15
  • 1
    ``np.random.shuffle`` is in-place, so is likely better if you have a big array to reorder. However if you cannot make it work for mutiple arrays, the in-place bonus doesn't help. For the records: also the solution from @cardamom is not in-place. – Francesco Montesano Nov 07 '17 at 13:27
  • Well, setting the `RandomState` doesn't even work for a single unison shuffling, so I guess that my syntax / way of setting the state is just wrong. I'll probably open a Github issue. – 0vbb Nov 07 '17 at 13:39
2

I don't know exactly what you are doing well, but you have not chosen the solution with the most votes on that page or with the second most votes. Try this one:

from sklearn.utils import shuffle
for i in range(10):
    X, Y = shuffle(X, Y, random_state=i)
    print ("X - ", X, "Y - ", Y)

Output:

X -  [3 5 1 4 2] Y -  [30 50 10 40 20]
X -  [1 5 2 3 4] Y -  [10 50 20 30 40]
X -  [2 4 5 3 1] Y -  [20 40 50 30 10]
X -  [3 1 4 2 5] Y -  [30 10 40 20 50]
X -  [3 2 1 5 4] Y -  [30 20 10 50 40]
X -  [4 3 2 1 5] Y -  [40 30 20 10 50]
X -  [1 5 4 3 2] Y -  [10 50 40 30 20]
X -  [1 3 4 5 2] Y -  [10 30 40 50 20]
X -  [2 4 3 1 5] Y -  [20 40 30 10 50]
X -  [1 2 4 3 5] Y -  [10 20 40 30 50]
cardamom
  • 6,873
  • 11
  • 48
  • 102
  • your solution, although valid, has the drawback that requires the full scikit lear package for a single functionality. – Francesco Montesano Nov 07 '17 at 11:13
  • fair comment, although am not sure that scikit learn takes more space on the hard drive than numpy. Only `shuffle` is imported and the solution uses fewer lines of code – cardamom Nov 07 '17 at 11:20
  • Great Solution! One line of code! This should be the best one!! – Chapin Jul 25 '21 at 13:04
1

I don't normally have to shuffle my data more than once at a time. But this function accommodates any number of input arrays, as well as any number of random shuffles - and it shuffles in-place.

import numpy as np


def shuffle_arrays(arrays, shuffle_quant=1):
    assert all(len(arr) == len(arrays[0]) for arr in arrays)
    max_int = 2**(32 - 1) - 1

    for i in range(shuffle_quant):
        seed = np.random.randint(0, max_int)
        for arr in arrays:
            rstate = np.random.RandomState(seed)
            rstate.shuffle(arr)

And can be used like this

a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])

shuffle_arrays([a, b, c], shuffle_quant=5)

A few things to note:

  • Method uses NumPy and no other packages.
  • The assert ensures that all input arrays have the same length along their first dimension.
  • The max_int keeps random seed within int32 range.
  • Arrays shuffled in-place by their first dimension - nothing returned.

After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.

Isaac B
  • 695
  • 1
  • 11
  • 20