84

What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0) that takes a dataframe, a number of shuffles n, and an axis (axis=0 is rows, axis=1 is columns) and returns a copy of the dataframe that has been shuffled n times.

Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index that loses all that information. I want the resulting df to be the same as the original except with the order of rows or order of columns different.

Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns a and b, I want each row shuffled on its own, so that you don't have the same associations between a and b as you do if you just re-order each row as a whole. Something like:

for 1...n:
  for each col in df: shuffle column
return new_df

But hopefully more efficient than naive looping. This does not work for me:

def shuffle(df, n, axis=0):
        shuffled_df = df.copy()
        for k in range(n):
            shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
        return shuffled_df

df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)
Alex Riley
  • 169,130
  • 45
  • 262
  • 238

10 Answers10

228

Use numpy's random.permuation function:

In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [2]: df
Out[2]:
   A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9


In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
   A  B
0  0  0
5  5  5
6  6  6
3  3  3
8  8  8
7  7  7
9  9  9
1  1  1
2  2  2
4  4  4
Gilad Green
  • 36,708
  • 7
  • 61
  • 95
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • 30
    +1 because this is exactly what I was looking for (even though it turns out it's not what the OP wanted) – Doug Paul Nov 22 '13 at 14:45
  • 4
    Also can use `df.iloc[np.random.permutation(np.arange(len(df)))]` if there's dupes and stuff (and may be faster for mi). – Andy Hayden Jan 28 '14 at 23:13
  • 3
    Nice method. Is there a way to do it in-place though? – Andrew Sep 27 '14 at 15:48
  • 3
    For me (Python v3.6 and Pandas v0.20.1) I had to replace `df.reindex(np.random.permutation(df.index))` by `df.set_index(np.random.permutation(df.index))` to get the desired effect. – Emanuel Jun 29 '17 at 16:25
  • 1
    after `set_index` like Emanuel, I also needed `df.sort_index(inplace=True)` – Shadi Oct 21 '17 at 10:19
  • This does not work anymore. Running python 3.6.5, numpy 1.15.0, pandas 0.23.3, the only solution that worked was Andy Hayden's one : `df.iloc[np.random.permutation(np.arange(len(df)))]` – Sindarus Aug 01 '18 at 12:31
98

Sampling randomizes, so just sample the entire data frame.

df.sample(frac=1)

As @Corey Levinson notes, you have to be careful when you reassign:

df['column'] = df['column'].sample(frac=1).reset_index(drop=True)
Roelant
  • 4,508
  • 1
  • 32
  • 62
W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111
  • 9
    Note if you are trying to reassign a column using this, you have to do `df['column'] = df['column'].sample(frac=1).reset_index(drop=True)` – Corey Levinson Mar 29 '19 at 21:22
43
In [16]: def shuffle(df, n=1, axis=0):     
    ...:     df = df.copy()
    ...:     for _ in range(n):
    ...:         df.apply(np.random.shuffle, axis=axis)
    ...:     return df
    ...:     

In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [18]: shuffle(df)

In [19]: df
Out[19]: 
   A  B
0  8  5
1  1  7
2  7  3
3  6  2
4  3  4
5  0  1
6  9  0
7  4  6
8  2  8
9  5  9
root
  • 76,608
  • 25
  • 108
  • 120
  • 2
    How do I distinguish rows from column shuffling here? –  Apr 02 '13 at 19:13
  • Thanks.. I clarified my question which was unclear. I am looking to shuffle by row independently of other rows - so shuffle in such a way that you don't always have `1,5` together and `4,8` together (but also not just a column shuffle which limits you to two choices) –  Apr 02 '13 at 19:18
  • 15
    **warning** I thought ``df.apply(np.random.permutation)`` would work as the solution ``df.reindex(np.random.permutation(df.index))`` and looked neater, but actually they behave differently. The latter maintains association between columns of the same row, the former doesn't. My misunderstanding, of course, but hopefully it will save other people from the same mistake. – gozzilli Feb 12 '15 at 10:33
  • 1
    What is 'np' in this context? – Sledge Mar 07 '17 at 20:43
  • 1
    numpy. It's common to do: `import numpy as np` – Aku Mar 30 '17 at 23:40
  • @root What does "n" in the data stands for? Can we change it to other values, what is the max value? – cincin21 Apr 19 '21 at 07:58
  • It seems like `n` is "how many times do you want to shuffle?" In that case, shuffling more than once doesn't make much sense (unless you think the rng is suspect). – Teepeemm Apr 20 '21 at 13:39
  • 1
    I only wanted to do one shuffle so I just used `df.apply(np.random.shuffle, index=1)` but this doesn't seem to do anything, printing the resulting df looks exactly the same as the input. If I do `df = df.apply( ... )` I get a Series with `Nans.` If I do `df.apply( ... inplace=True)` then I get an error. – Veggiet May 29 '21 at 16:31
23

You can use sklearn.utils.shuffle() (requires sklearn 0.16.1 or higher to support Pandas data frames):

# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))

# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))

outputs:

df:    A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4


df:    A  B
1  1  1
0  0  0
3  3  3
4  4  4
2  2  2

Then you can use df.reset_index() to reset the index column, if needs to be:

df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)

outputs:

df:    A  B
0  1  1
1  0  0
2  4  4
3  2  2
4  3  3
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
10

A simple solution in pandas is to use the sample method independently on each column. Use apply to iterate over each column:

df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6

df.apply(lambda x: x.sample(frac=1).values)

   a  b
0  4  2
1  1  6
2  6  5
3  5  3
4  2  4
5  3  1

You must use .value so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:

df.apply(lambda x: x.sample(frac=1))

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • Thanks @Ted Exactly why I came here. Spot on! – trazoM May 22 '23 at 11:52
  • I shuffled a single column by doing `np.random.shuffle(df['b'].values)` . Take note that [`np.random.shuffle()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.shuffle.html) modifies your dataframe in-place. – trazoM May 22 '23 at 12:14
6

From the docs use sample():

In [79]: s = pd.Series([0,1,2,3,4,5])

# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]: 
0    0
dtype: int64

# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]: 
5    5
2    2
4    4
dtype: int64

# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]: 
5    5
4    4
1    1
dtype: int64
Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
4

I resorted to adapting @root 's answer slightly and using the raw values directly. Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.

In [1]: import numpy

In [2]: import pandas

In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})    

In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop

In [5]: %%timeit
   ...: for view in numpy.rollaxis(df.values, 1):
   ...:     numpy.random.shuffle(view)
   ...: 
10000 loops, best of 3: 22.8 µs per loop

In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop

In [7]: %%timeit                                      
for view in numpy.rollaxis(df.values, 0):
    numpy.random.shuffle(view)
   ...: 
10000 loops, best of 3: 23.4 µs per loop

Note that numpy.rollaxis brings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, i.e., if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.

In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)

In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)

Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:

def shuffle(df, n=1, axis=0):     
    df = df.copy()
    axis = int(not axis) # pandas.DataFrame is always 2D
    for _ in range(n):
        for view in numpy.rollaxis(df.values, axis):
            numpy.random.shuffle(view)
    return df
Midnighter
  • 3,771
  • 2
  • 29
  • 43
3

This might be more useful when you want your index shuffled.

def shuffle(df):
    index = list(df.index)
    random.shuffle(index)
    df = df.ix[index]
    df.reset_index()
    return df

It selects new df using new index, then reset them.

JeromeZhao
  • 115
  • 2
  • 6
2

I know the question is for a pandas df but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.array instead, then np.apply_along_axis() will be what you are looking for.

If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.

If you panda data frame is named df, maybe you can:

  1. get the values of the dataframe with values = df.values,
  2. create an np.array from values
  3. apply the method shown below to shuffle the np.array by row or column
  4. recreate a new (shuffled) pandas df from the shuffled np.array

Original array

a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]

Keep row order, shuffle colums within each row

print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
 [22 21 20]
 [31 30 32]
 [40 41 42]]

Keep colums order, shuffle rows within each column

print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
 [20 31 42]
 [10 11 12]
 [30 21 22]]

Original array is unchanged

print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]
Raphvanns
  • 1,766
  • 19
  • 21
0

Here is a work around I found if you want to only shuffle a subset of the DataFrame:

shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])
ashimashi
  • 463
  • 1
  • 5
  • 14