shuffling/permutating a DataFrame in pandas

Question

What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0) that takes a dataframe, a number of shuffles n, and an axis (axis=0 is rows, axis=1 is columns) and returns a copy of the dataframe that has been shuffled n times.

Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index that loses all that information. I want the resulting df to be the same as the original except with the order of rows or order of columns different.

Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns a and b, I want each row shuffled on its own, so that you don't have the same associations between a and b as you do if you just re-order each row as a whole. Something like:

for 1...n:
  for each col in df: shuffle column
return new_df

But hopefully more efficient than naive looping. This does not work for me:

def shuffle(df, n, axis=0):
        shuffled_df = df.copy()
        for k in range(n):
            shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
        return shuffled_df

df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)

[See this simple pandas solution below](https://stackoverflow.com/a/47112434/3707607) — Ted Petrou, Nov 04 '17 at 15:42
^ Your answer does answer the question but it seems is not the answer people are looking for — cs95, Jan 22 '19 at 09:25

score 228 · Answer 1 · edited Apr 22 '18 at 12:01

228

Use numpy's random.permuation function:

In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [2]: df
Out[2]:
   A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9


In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
   A  B
0  0  0
5  5  5
6  6  6
3  3  3
8  8  8
7  7  7
9  9  9
1  1  1
2  2  2
4  4  4

edited Apr 22 '18 at 12:01

Gilad Green

36,708
7
61
95

answered Apr 02 '13 at 19:09

Zelazny7

39,946
18
70
84

30

+1 because this is exactly what I was looking for (even though it turns out it's not what the OP wanted) – Doug Paul Nov 22 '13 at 14:45
4

Also can use `df.iloc[np.random.permutation(np.arange(len(df)))]` if there's dupes and stuff (and may be faster for mi). – Andy Hayden Jan 28 '14 at 23:13
3

Nice method. Is there a way to do it in-place though? – Andrew Sep 27 '14 at 15:48
3

For me (Python v3.6 and Pandas v0.20.1) I had to replace `df.reindex(np.random.permutation(df.index))` by `df.set_index(np.random.permutation(df.index))` to get the desired effect. – Emanuel Jun 29 '17 at 16:25
1

after `set_index` like Emanuel, I also needed `df.sort_index(inplace=True)` – Shadi Oct 21 '17 at 10:19
This does not work anymore. Running python 3.6.5, numpy 1.15.0, pandas 0.23.3, the only solution that worked was Andy Hayden's one : `df.iloc[np.random.permutation(np.arange(len(df)))]` – Sindarus Aug 01 '18 at 12:31

score 98 · Answer 2 · edited Aug 23 '21 at 19:06

98

Sampling randomizes, so just sample the entire data frame.

df.sample(frac=1)

As @Corey Levinson notes, you have to be careful when you reassign:

df['column'] = df['column'].sample(frac=1).reset_index(drop=True)

edited Aug 23 '21 at 19:06

Roelant

4,508
1
32
62

answered Mar 03 '16 at 22:51

W.P. McNeill

16,336
12
75
111

9

Note if you are trying to reassign a column using this, you have to do `df['column'] = df['column'].sample(frac=1).reset_index(drop=True)` – Corey Levinson Mar 29 '19 at 21:22

root · Accepted Answer · 2013-04-02T19:41:27.820

43

In [16]: def shuffle(df, n=1, axis=0):     
    ...:     df = df.copy()
    ...:     for _ in range(n):
    ...:         df.apply(np.random.shuffle, axis=axis)
    ...:     return df
    ...:     

In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [18]: shuffle(df)

In [19]: df
Out[19]: 
   A  B
0  8  5
1  1  7
2  7  3
3  6  2
4  3  4
5  0  1
6  9  0
7  4  6
8  2  8
9  5  9

edited Apr 02 '13 at 19:41

answered Apr 02 '13 at 19:10

root

76,608
25
108
120

2

How do I distinguish rows from column shuffling here? – Apr 02 '13 at 19:13
Thanks.. I clarified my question which was unclear. I am looking to shuffle by row independently of other rows - so shuffle in such a way that you don't always have `1,5` together and `4,8` together (but also not just a column shuffle which limits you to two choices) – Apr 02 '13 at 19:18
15

**warning** I thought ``df.apply(np.random.permutation)`` would work as the solution ``df.reindex(np.random.permutation(df.index))`` and looked neater, but actually they behave differently. The latter maintains association between columns of the same row, the former doesn't. My misunderstanding, of course, but hopefully it will save other people from the same mistake. – gozzilli Feb 12 '15 at 10:33
1

What is 'np' in this context? – Sledge Mar 07 '17 at 20:43
1

numpy. It's common to do: `import numpy as np` – Aku Mar 30 '17 at 23:40
@root What does "n" in the data stands for? Can we change it to other values, what is the max value? – cincin21 Apr 19 '21 at 07:58
It seems like `n` is "how many times do you want to shuffle?" In that case, shuffling more than once doesn't make much sense (unless you think the rng is suspect). – Teepeemm Apr 20 '21 at 13:39
1

I only wanted to do one shuffle so I just used `df.apply(np.random.shuffle, index=1)` but this doesn't seem to do anything, printing the resulting df looks exactly the same as the input. If I do `df = df.apply( ... )` I get a Series with `Nans.` If I do `df.apply( ... inplace=True)` then I get an error. – Veggiet May 29 '21 at 16:31

score 23 · Answer 4 · answered Aug 11 '16 at 17:40

You can use sklearn.utils.shuffle() (requires sklearn 0.16.1 or higher to support Pandas data frames):

# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))

# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))

outputs:

df:    A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4


df:    A  B
1  1  1
0  0  0
3  3  3
4  4  4
2  2  2

Then you can use df.reset_index() to reset the index column, if needs to be:

df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)

outputs:

df:    A  B
0  1  1
1  0  0
2  4  4
3  2  2
4  3  3

FYI, `df.sample(frac=1)` is marginally faster (76.9 vs 78.9 ms for 400k rows). — m-dz, Feb 12 '18 at 11:07

score 10 · Answer 5 · answered Nov 04 '17 at 15:40

10

A simple solution in pandas is to use the sample method independently on each column. Use apply to iterate over each column:

df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6

df.apply(lambda x: x.sample(frac=1).values)

   a  b
0  4  2
1  1  6
2  6  5
3  5  3
4  2  4
5  3  1

You must use .value so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:

df.apply(lambda x: x.sample(frac=1))

   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5
5  6  6

answered Nov 04 '17 at 15:40

Ted Petrou

59,042
19
131
136

Thanks @Ted Exactly why I came here. Spot on! – trazoM May 22 '23 at 11:52
I shuffled a single column by doing `np.random.shuffle(df['b'].values)` . Take note that [`np.random.shuffle()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.shuffle.html) modifies your dataframe in-place. – trazoM May 22 '23 at 12:14

score 6 · Answer 6 · answered Feb 24 '16 at 19:07

From the docs use sample():

In [79]: s = pd.Series([0,1,2,3,4,5])

# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]: 
0    0
dtype: int64

# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]: 
5    5
2    2
4    4
dtype: int64

# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]: 
5    5
4    4
1    1
dtype: int64

Midnighter · Answer 7 · 2014-02-02T19:01:46.053

I resorted to adapting @root 's answer slightly and using the raw values directly. Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.

In [1]: import numpy

In [2]: import pandas

In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})    

In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop

In [5]: %%timeit
   ...: for view in numpy.rollaxis(df.values, 1):
   ...:     numpy.random.shuffle(view)
   ...: 
10000 loops, best of 3: 22.8 µs per loop

In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop

In [7]: %%timeit                                      
for view in numpy.rollaxis(df.values, 0):
    numpy.random.shuffle(view)
   ...: 
10000 loops, best of 3: 23.4 µs per loop

Note that numpy.rollaxis brings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, i.e., if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.

In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)

In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)

Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:

def shuffle(df, n=1, axis=0):     
    df = df.copy()
    axis = int(not axis) # pandas.DataFrame is always 2D
    for _ in range(n):
        for view in numpy.rollaxis(df.values, axis):
            numpy.random.shuffle(view)
    return df

score 3 · Answer 8 · answered Aug 14 '14 at 23:48

3

This might be more useful when you want your index shuffled.

def shuffle(df):
    index = list(df.index)
    random.shuffle(index)
    df = df.ix[index]
    df.reset_index()
    return df

It selects new df using new index, then reset them.

answered Aug 14 '14 at 23:48

JeromeZhao

115
2
6

score 2 · Answer 9 · answered Jun 21 '17 at 21:18

I know the question is for a pandas df but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.array instead, then np.apply_along_axis() will be what you are looking for.

If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.

If you panda data frame is named df, maybe you can:

get the values of the dataframe with values = df.values,
create an np.array from values
apply the method shown below to shuffle the np.array by row or column
recreate a new (shuffled) pandas df from the shuffled np.array

Original array

a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]

Keep row order, shuffle colums within each row

print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
 [22 21 20]
 [31 30 32]
 [40 41 42]]

Keep colums order, shuffle rows within each column

print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
 [20 31 42]
 [10 11 12]
 [30 21 22]]

Original array is unchanged

print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]

score 0 · Answer 10 · answered Jun 23 '16 at 19:28

0

Here is a work around I found if you want to only shuffle a subset of the DataFrame:

shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])

answered Jun 23 '16 at 19:28

ashimashi

463
1
5
14

shuffling/permutating a DataFrame in pandas

10 Answers10

Original array

Keep row order, shuffle colums within each row

Keep colums order, shuffle rows within each column

Original array is unchanged

Linked

Related