821

I have the following DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

The DataFrame is read from a CSV file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

I would like to shuffle the order of the DataFrame's rows so that all Type's are mixed. A possible result could be:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

How can I achieve this?

Innat
  • 16,113
  • 6
  • 53
  • 101
JNevens
  • 11,202
  • 9
  • 46
  • 72

14 Answers14

1496

The idiomatic way to do this with Pandas is to use the .sample method of your data frame to sample all rows without replacement:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means to return all rows (in random order).


Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

Innat
  • 16,113
  • 6
  • 53
  • 101
Kris
  • 22,079
  • 3
  • 30
  • 35
  • It doesn't. That line just reassigns the `df` object, thereby effectively changing the object in place. See it as a workaround. – Kris Feb 12 '18 at 12:57
  • 7
    Yes, this is exactly what I wanted to show in my first comment, you have to assign the necessary memory twice, which is quite far from doing it in place. – m-dz Feb 12 '18 at 13:13
  • 2
    @m-dz Correct me if I'm wrong, but if you don't do `.copy()` you're still referencing the same underlying object. – Kris Feb 12 '18 at 13:54
  • To the best of my knowledge, not after using `sample()`, which returns a new object. Try `print(hex(id(df)))` and `print(hex(id(df.sample(frac=1).reset_index(drop=True))))`. But I might be wrong, basically wanted to ask for a confirmation or negation here. – m-dz Feb 12 '18 at 14:08
  • 7
    no, it doesn't copy the DataFrame, just look at this line: https://github.com/pandas-dev/pandas/blob/v0.23.0/pandas/core/generic.py#L4198 – minhle_r7 May 20 '18 at 10:26
  • @ngọcminh.oss Unfortunately, what you said is incorrect. It does create a new object. See https://pastebin.com/uNXzW9AU for an example. The resulting ids are different; therefore, a new object is created. – Gigi Bayte 2 Dec 17 '18 at 06:00
  • 3
    @m-dz I ran a memory profiler on it. See "follow-up note" in the updated answer. – Kris Jun 27 '19 at 01:18
  • Can you also generate the same order for shuffle, like `random_state=42`? – PV8 Aug 16 '19 at 13:31
  • 1
    @PV8 Yes you can. – Kris Aug 17 '19 at 23:48
  • 1
    @Kris, a bit late, but thanks for clarification! Pandas memory management is sometimes a bit of magic. (Edit: I've deleted a few irrelavant comments from the begining of this conversation, thanks again!) – m-dz Mar 03 '20 at 16:33
  • How to use the same methodology but only applying it on one column of a dataframe? Can you help me with this question? https://stackoverflow.com/questions/60687700/how-to-randomly-shuffle-a-populaiton-by-preserving-all-properites-except-one/60687757#60687757 – Rebel Mar 15 '20 at 03:32
  • How about `df['col'] = df['col'].sample(frac=1).values` ? – Kris Mar 16 '20 at 02:20
  • What would the value of frac other than 1 mean? – Hasham Beyg Jun 29 '20 at 20:51
  • It would mean a smaller (or larger, if sampling with replacement) fraction of rows, see [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) – Kris Jun 30 '20 at 03:04
  • I am running into memory issues with Pandas 1.2.4 for the shuffle operation. From what I can tell, as of Pandas 1, the sample operation copies data due to `take` deprecating is_copy. See https://github.com/pandas-dev/pandas/blame/master/pandas/core/generic.py#L5353 and https://github.com/pandas-dev/pandas/pull/30615 . Any other recommendations on how to shuffle in place? – user3608214 Jun 16 '21 at 14:32
  • @user3608214 Sorry for the late reply. I'm unable to repro your issue. I ran the memory profiler and I didn't see an increase in memory consumption by the `df = df.sample(...)` line. – Kris Oct 11 '21 at 11:18
  • @Kris. Can I shuffle the data such that no two columns values are in the same row after shuffling please? – Avv Dec 28 '21 at 19:46
  • I don't think this code proves a copy isnt made. If you run `mprof` you will see the memory usage is as much as 3x the size of the dataframe. The above memory_profiler output just says that line incremented the memory that much when the function returned (it could have allocated lots during the function) Run the below and you do not see a 1GB increment (sorry about formatting) ```python import numpy as np def big_memory_user(): arr = np.random.rand(int(1e9)) return "small string with minimal memory" @profile def main(): foo = big_memory_user() main() ``` – Casey May 04 '22 at 15:12
358

You can simply use sklearn for this

from sklearn.utils import shuffle
df = shuffle(df)
Innat
  • 16,113
  • 6
  • 53
  • 101
tj89
  • 3,953
  • 2
  • 12
  • 12
  • 48
    This is nice, but you may need to reset your indexes after shuffling: df.reset_index(inplace=True, drop=True) – cemsazara Jun 17 '19 at 20:41
  • in what case would you need to reset your indexes? Would resetting the index trigger a copy/memory allocation? – gg99 Jul 28 '23 at 14:12
83

You can shuffle the rows of a data frame by indexing with a shuffled index. For this, you can eg use np.random.permutation (but np.random.choice is also a possibility):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)

Innat
  • 16,113
  • 6
  • 53
  • 101
joris
  • 133,120
  • 36
  • 247
  • 202
64

TL;DR: np.random.shuffle(ndarray) can do the job.
So, in your case

np.random.shuffle(DataFrame.values)

DataFrame, under the hood, uses NumPy ndarray as a data holder. (You can check from DataFrame source code)

So if you use np.random.shuffle(), it would shuffle the array along the first axis of a multi-dimensional array. But the index of the DataFrame remains unshuffled.

Though, there are some points to consider.

  • function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
  • sklearn.utils.shuffle(), as user tj89 suggested, can designate random_state along with another option to control output. You may want that for dev purposes.
  • sklearn.utils.shuffle() is faster. But WILL SHUFFLE the axis info(index, column) of the DataFrame along with the ndarray it contains.

Benchmark result

between sklearn.utils.shuffle() and np.random.shuffle().

ndarray

nd = sklearn.utils.shuffle(nd)

0.10793248389381915 sec. 8x faster

np.random.shuffle(nd)

0.8897626010002568 sec

DataFrame

df = sklearn.utils.shuffle(df)

0.3183923360193148 sec. 3x faster

np.random.shuffle(df.values)

0.9357550159329548 sec

Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use sklearn.utils.shuffle(). Otherwise, use np.random.shuffle()

used code

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)

Innat
  • 16,113
  • 6
  • 53
  • 101
h2ku
  • 947
  • 8
  • 11
  • 6
    Doesn't `df = df.sample(frac=1)` do the exact same thing as `df = sklearn.utils.shuffle(df)`? According to my measurements `df = df.sample(frac=1)` is faster and seems to perform the exact same action. They also both allocate new memory. `np.random.shuffle(df.values)` is the slowest, but does not allocate new memory. – lo tolmencre Feb 10 '19 at 09:48
  • 2
    In terms of shuffling the axis along with the data, it's seems like it can do the same. And yes, it seems like `df.sample(frac=1)` is about 20% faster than `sklearn.utils.shuffle(df)`, using the same code above. Or you could do `sklearn.utils.shuffle(ndarray)` to get different result. – h2ku Apr 23 '19 at 05:53
  • 1
    ...and it's really not okay for to index to be shuffled, as it can lead to hard to trace problems with some functions, that either reset index or rely on assumptions about max index on the basis of rows count. This happened to for instance with `h2o_model.predict()`, which resets index on returned predictions Frame. – mirekphd Mar 24 '21 at 18:08
26

(I don't have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method:

df.sample(frac=1)

It makes a deep copy or just changed the dataframe. I ran the following code:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

and my results were:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

which means the method is not returning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.

Innat
  • 16,113
  • 6
  • 53
  • 101
NotANumber
  • 461
  • 5
  • 5
  • 3
    Please have a look at the **Follow-up note** of the original answer. There you'll see that even though the references have changed (different `id`s), the underlying object is *not* copied. In other words, the operation is effectively in-memory (although admittedly it's not obvious). – Kris Aug 17 '19 at 23:56
  • I would expect that the underlying ndarray is the same but the iterator is different (and random) hence minimal change in the memory consumption although a change in the elements' order. – sophros Jul 03 '20 at 14:03
26

Following could be one of ways:

dataframe = dataframe.sample(frac=1, random_state=42).reset_index(drop=True)

where

frac=1 means all rows of a data frame

random_state=42 means keeping the same order in each execution

reset_index(drop=True) means reinitialize index for randomized dataframe

Innat
  • 16,113
  • 6
  • 53
  • 101
Anshul Singhal
  • 1,983
  • 20
  • 25
21

What is also useful, if you use it for Machine_learning and want to separate always the same data, you could use:

df.sample(n=len(df), random_state=42)

This makes sure, that you keep your random choice always replicable

Innat
  • 16,113
  • 6
  • 53
  • 101
PV8
  • 5,799
  • 7
  • 43
  • 87
6

Here is another way to do this:

df_shuffled = df.reindex(np.random.permutation(df.index))
Innat
  • 16,113
  • 6
  • 53
  • 101
Ido Cohn
  • 1,685
  • 3
  • 21
  • 28
  • 3
    Please, notice this changes the indices in the original df, as well as producing a copy, which you are saving into df_shuffled. But, which is more worrying, anything that does not depend in the index, for example `df_shuffled.iterrows()' will produce exactly the same order as df. In summary, use with caution! – Jblasco Sep 10 '18 at 14:49
  • @Jblasco This is incorrect, the original df is **not** changed at all. Documentation of [`np.random.permutation`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.permutation.html): "...If x is an array, make a **copy** and shuffle the elements randomly". Documentation of [`DataFrame.reindex`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html): "A **new object** is produced unless the new index is equivalent to the current one and copy=False". So the answer is perfectly safe (albeit producing a copy). – Andreas Schörgenhumer Oct 12 '18 at 08:12
  • 3
    @AndreasSchörgenhumer, thank you for pointing this out, you are partially right! I knew I had tried it, so I did some testing. Despite what the documentation of `np.random.permutation says`, and depending on versions of numpy, you get the effect I described or the one you mention. With numpy > 1.15.0, creating a dataframe and doing a plain `np.random.permutation(df.index)`, the indices in the original df change. The same is not true for numpy == 1.14.6. So, more than ever, I repeat my warning: that way of doing things is dangerous because of unforeseen side effects and version dependencies. – Jblasco Oct 12 '18 at 09:04
  • @Jblasco You are right, thank you for the details. I was running numpy 1.14, so everything worked just fine. With numpy 1.15 there seems to be a [bug](https://github.com/pandas-dev/pandas/issues/23058) somewhere. In the light of this bug, your warnings are currently indeed correct. However, as it is a _bug_ and the documentation states other behavior, I still stick to my previous statement that the answer is safe (given that the documentation does reflect the actual behavior, which we should normally be able to rely on). – Andreas Schörgenhumer Oct 12 '18 at 10:52
  • @AndreasSchörgenhumer, not quite sure if it's a bug or a feature, to be honest. Documentation guarantees a copy of an array, not a `Index` type... In any case, I base my recommendations/warnings on actual behaviour, not on the docs :p – Jblasco Oct 12 '18 at 11:20
3

shuffle the pandas data frame by taking a sample array in this case index and randomize its order then set the array as an index of data frame. Now sort the data frame according to index. Here goes your shuffled dataframe

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

output

    a   b
0   2   6
1   1   5
2   3   7
3   4   8

Insert you data frame in the place of mine in above code .

  • I prefer this method as it means the shuffle can be repeated if I need to reproduce my algorithm output exactly, by storing the randomised index to a variable. – rayzinnz Aug 10 '19 at 18:47
2

Without numpy/sklean :) and in case you want to shuffle all values, but keep rows & columns names in place.

df_c = df.copy()
df_c.iloc[:,:] = df_c.sample(frac=1,random_state=123,ignore_index=True)
Sahar Millis
  • 801
  • 2
  • 13
  • 21
1

Here is another way:

df['rnd'] = np.random.rand(len(df))
df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)
Innat
  • 16,113
  • 6
  • 53
  • 101
soulmachine
  • 3,917
  • 4
  • 46
  • 56
1

Shuffle the DataFrame using sample() by passing the frac parameter. Save the shuffled DataFrame to a new variable.

new_variable = DataFrame.sample(frac=1)
Innat
  • 16,113
  • 6
  • 53
  • 101
Ayaz Lakho
  • 21
  • 2
0

I propose this:

for x in df.columns:
    np.random.seed(42);
    np.random.shuffle(df[x].values)

With my test with a column of arbitrary length strings (with dtype: object), it was 30x faster than @haku's answer, presumably because it avoids creating a copy which may be expensive.

My variant was about 3x faster than the accepted @Kris'es answer which also does not seem to avoid a copy (based on RES column in Linux top).

Valentas
  • 2,014
  • 20
  • 24
0

Since Pandas 1.3 you have ignore_index=True, which can be more efficient than later resetting the index:

df = df.sample(frac=1, ignore_index=True)
Amit Portnoy
  • 5,957
  • 2
  • 29
  • 30