Random row selection in Pandas dataframe

Question

Is there a way to select random rows from a DataFrame in Pandas.

In R, using the car package, there is a useful function some(x, n) which is similar to head but selects, in this example, 10 rows at random from x.

I have also looked at the slicing documentation and there seems to be nothing equivalent.

Update

Now using version 20. There is a sample method.

df.sample(n)

If you are looking to sample where the size is greater than the original, use `df.sample(N, replace=True)`. More details [here](https://stackoverflow.com/questions/54052386/sampling-rows-with-sample-size-greater-than-length-of-dataframe/54052396#54052396). — cs95, Jan 05 '19 at 14:40

ryanjdillon · Answer 1 · 2022-05-05T07:26:31.217

394

With pandas version 0.16.1 and up, there is now a DataFrame.sample method built-in:

import pandas

df = pandas.DataFrame(pandas.np.random.random(100))

# Randomly sample 70% of your dataframe
df_percent = df.sample(frac=0.7)

# Randomly sample 7 elements from your dataframe
df_elements = df.sample(n=7)

For either approach above, you can get the rest of the rows by doing:

df_rest = df.loc[~df.index.isin(df_percent.index)]

Per Pedram's comment, if you would like to get reproducible samples, pass the random_state parameter.

df_percent = df.sample(frac=0.7, random_state=42)

edited May 05 '22 at 07:26

answered Sep 16 '15 at 10:57

ryanjdillon

17,658
9
85
110

1

``df_0.7`` is not a valid name. Moreover, I suggest replacing ``df_rest = df.loc[~df.index.isin(df_0_7.index)]`` with ``df_rest = df.loc[df.index.difference(df_0_7.index)]``. – Pietro Battiston May 01 '18 at 15:24
@PietroBattiston Thanks. I was attempting to make the answer clearer, but I agree a non-working example is not clear. Nice with the tip on difference. Though, I still prefer writing the slicing so that I read it as indices "not in the index of my sample". Is there a performance increase with `difference()`? – ryanjdillon May 04 '18 at 07:50
1

@ryanjdillon there was a remaining typo, I fixed it. Concerning the method, I'm actually taking back my suggestion, as indeed it's a bit less efficient. ``df_percent.index.get_indexer(df.index) == -1`` is far more efficient instead (but also more ugly)... – Pietro Battiston May 05 '18 at 08:59
2

Nice answer. Also for folks to ensure reproducibility in your code, you might want to add `random_state` to `sample`. E.g., `df_percent = df.sample(frac=0.7, random_state=42)` – Pedram May 04 '22 at 20:12

score 71 · Accepted Answer · edited Apr 28 '19 at 12:08

71

Something like this?

import random

def some(x, n):
    return x.ix[random.sample(x.index, n)]

Note: As of Pandas v0.20.0, ix has been deprecated in favour of loc for label based indexing.

edited Apr 28 '19 at 12:08

jpp

159,742
34
281
339

answered Apr 10 '13 at 10:55

eumiro

207,213
34
299
261

8

Thanks @eumiro. I also worked out that `df.ix[np.random.random_integers(0, len(df), 10)]` would also work. – John Apr 10 '13 at 10:58
7

If you want to use numpy, then you can also do `df.ix[np.random.choice(df.index, 10)]`. – naught101 Feb 17 '14 at 02:53
7

Someone in an other post mentioned that `np.random.choice` is twice as fast as `random.sample` – Phani Jul 07 '14 at 19:00
6

If you use np.random.choice you have to specify replace=False, otherwise you'll get duplicate rows! – stmax Aug 10 '15 at 12:39
2

I think ".ix" is deprecated, and you should use .loc for label based indexing – compguy24 Feb 27 '19 at 17:04

score 50 · Answer 3 · edited Dec 30 '18 at 10:31

`sample`

As of v0.20.0, you can use pd.DataFrame.sample, which can be used to return a random sample of a fixed number rows, or a percentage of rows:

df = df.sample(n=k)     # k rows
df = df.sample(frac=k)  # int(len(df.index) * k) rows

For reproducibility, you can specify an integer random_state, equivalent to using np.ramdom.seed. So, instead of setting, for example, np.random.seed = 0, you can:

df = df.sample(n=k, random_state=0)

score 12 · Answer 4 · answered Aug 23 '13 at 18:17

The best way to do this is with the sample function from the random module,

import numpy as np
import pandas as pd
from random import sample

# given data frame df

# create random index
rindex =  np.array(sample(xrange(len(df)), 10))

# get 10 random rows from df
dfr = df.ix[rindex]

score 5 · Answer 5 · edited Mar 17 '22 at 10:03

5

Below line will randomly select n number of rows out of the total existing row numbers from the dataframe df without replacement.

df = df.take(np.random.permutation(len(df))[:n])

edited Mar 17 '22 at 10:03

vvvvv

25,404
19
49
81

answered Jun 29 '17 at 17:08

Mojgan Mazouchi

355
1
6
15

score 4 · Answer 6 · answered Jul 31 '13 at 23:07

4

Actually this will give you repeated indices np.random.random_integers(0, len(df), N) where N is a large number.

answered Jul 31 '13 at 23:07

rlmlr

321
2
6

Random row selection in Pandas dataframe

Update

6 Answers6

`sample`

Linked

Related