277

Is there a way to select random rows from a DataFrame in Pandas.

In R, using the car package, there is a useful function some(x, n) which is similar to head but selects, in this example, 10 rows at random from x.

I have also looked at the slicing documentation and there seems to be nothing equivalent.

Update

Now using version 20. There is a sample method.

df.sample(n)
vvvvv
  • 25,404
  • 19
  • 49
  • 81
John
  • 41,131
  • 31
  • 82
  • 106
  • 3
    If you are looking to sample where the size is greater than the original, use `df.sample(N, replace=True)`. More details [here](https://stackoverflow.com/questions/54052386/sampling-rows-with-sample-size-greater-than-length-of-dataframe/54052396#54052396). – cs95 Jan 05 '19 at 14:40

6 Answers6

394

With pandas version 0.16.1 and up, there is now a DataFrame.sample method built-in:

import pandas

df = pandas.DataFrame(pandas.np.random.random(100))

# Randomly sample 70% of your dataframe
df_percent = df.sample(frac=0.7)

# Randomly sample 7 elements from your dataframe
df_elements = df.sample(n=7)

For either approach above, you can get the rest of the rows by doing:

df_rest = df.loc[~df.index.isin(df_percent.index)]

Per Pedram's comment, if you would like to get reproducible samples, pass the random_state parameter.

df_percent = df.sample(frac=0.7, random_state=42)
ryanjdillon
  • 17,658
  • 9
  • 85
  • 110
  • 1
    ``df_0.7`` is not a valid name. Moreover, I suggest replacing ``df_rest = df.loc[~df.index.isin(df_0_7.index)]`` with ``df_rest = df.loc[df.index.difference(df_0_7.index)]``. – Pietro Battiston May 01 '18 at 15:24
  • @PietroBattiston Thanks. I was attempting to make the answer clearer, but I agree a non-working example is not clear. Nice with the tip on difference. Though, I still prefer writing the slicing so that I read it as indices "not in the index of my sample". Is there a performance increase with `difference()`? – ryanjdillon May 04 '18 at 07:50
  • 1
    @ryanjdillon there was a remaining typo, I fixed it. Concerning the method, I'm actually taking back my suggestion, as indeed it's a bit less efficient. ``df_percent.index.get_indexer(df.index) == -1`` is far more efficient instead (but also more ugly)... – Pietro Battiston May 05 '18 at 08:59
  • 2
    Nice answer. Also for folks to ensure reproducibility in your code, you might want to add `random_state` to `sample`. E.g., `df_percent = df.sample(frac=0.7, random_state=42)` – Pedram May 04 '22 at 20:12
71

Something like this?

import random

def some(x, n):
    return x.ix[random.sample(x.index, n)]

Note: As of Pandas v0.20.0, ix has been deprecated in favour of loc for label based indexing.

jpp
  • 159,742
  • 34
  • 281
  • 339
eumiro
  • 207,213
  • 34
  • 299
  • 261
  • 8
    Thanks @eumiro. I also worked out that `df.ix[np.random.random_integers(0, len(df), 10)]` would also work. – John Apr 10 '13 at 10:58
  • 7
    If you want to use numpy, then you can also do `df.ix[np.random.choice(df.index, 10)]`. – naught101 Feb 17 '14 at 02:53
  • 7
    Someone in an other post mentioned that `np.random.choice` is twice as fast as `random.sample` – Phani Jul 07 '14 at 19:00
  • 6
    If you use np.random.choice you have to specify replace=False, otherwise you'll get duplicate rows! – stmax Aug 10 '15 at 12:39
  • 2
    I think ".ix" is deprecated, and you should use .loc for label based indexing – compguy24 Feb 27 '19 at 17:04
50

sample

As of v0.20.0, you can use pd.DataFrame.sample, which can be used to return a random sample of a fixed number rows, or a percentage of rows:

df = df.sample(n=k)     # k rows
df = df.sample(frac=k)  # int(len(df.index) * k) rows

For reproducibility, you can specify an integer random_state, equivalent to using np.ramdom.seed. So, instead of setting, for example, np.random.seed = 0, you can:

df = df.sample(n=k, random_state=0)
typhon04
  • 2,350
  • 25
  • 22
jpp
  • 159,742
  • 34
  • 281
  • 339
12

The best way to do this is with the sample function from the random module,

import numpy as np
import pandas as pd
from random import sample

# given data frame df

# create random index
rindex =  np.array(sample(xrange(len(df)), 10))

# get 10 random rows from df
dfr = df.ix[rindex]
rlmlr
  • 321
  • 2
  • 6
5

Below line will randomly select n number of rows out of the total existing row numbers from the dataframe df without replacement.

df = df.take(np.random.permutation(len(df))[:n])
vvvvv
  • 25,404
  • 19
  • 49
  • 81
Mojgan Mazouchi
  • 355
  • 1
  • 6
  • 15
4

Actually this will give you repeated indices np.random.random_integers(0, len(df), N) where N is a large number.

rlmlr
  • 321
  • 2
  • 6