Randomly select rows from DataFrame Pandas

Question

Okay this is somewhat tricky. I have a DataFrame of people and I want to randomly select 27% of them. I want to create a new Boolean column in that DataFrame that shows if that person was randomly selected.

Anyone have any idea how to do this?

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html If you collect the ids from the sample, you can join this back on to see if they were sampled. — rgk, Jul 20 '20 at 20:38
`df.sample(frac=0.27)` or `df['selected'] = np.random.choice([0,1], size=len(df), p=[0.73,0.27])`? — Quang Hoang, Jul 20 '20 at 20:43
Great, the sample function seems like a great way to do it. But how do I create a new column that shows if a person was in the sample? — , Jul 20 '20 at 20:46

score 1 · Accepted Answer · answered Jul 20 '20 at 20:48

1

The in-built sample function provides a frac argument to give the fraction contained in the sample.

If your DataFrame of people is people_df:

percent_sampled = 27
sample_df = people_df.sample(frac = percent_sampled/100)

people_df['is_selected'] = people_df.index.isin(sample_df.index)

answered Jul 20 '20 at 20:48

nathan.j.mcdougall

475
1
6
12

This make sense, thank you. However I'm getting an error. "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead" – Jul 20 '20 at 20:52
This is a silly warning that `pandas` incessantly throws, for very little good reason. Please read the discussion here: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas I would recommend simply suppressing the warning by using `pd.options.mode.chained_assignment = None` after you import `pandas`. – nathan.j.mcdougall Jul 20 '20 at 20:56

score 0 · Answer 2 · answered Jul 20 '20 at 20:38

0

n = len(df) 
idx = np.arange(n)
idx = random.shuffle(idx)
*selected_idx = idx[:int(0.27*n)] 
selected_df = df[df.index.isin(selected_idx)]

answered Jul 20 '20 at 20:38

score 0 · Answer 3 · answered Jul 20 '20 at 21:06

Defining a dataframe with 100 random numbers in column 0:

import random
import pandas as pd
import numpy as np
a = pd.DataFrame(range(100))
random.shuffle(a[0])

Using random.sample to choose 27 random numbers from the list, WITHOUT repetition: (replace 27 with 0.27*int(len(a[0]) if you want to define this as percentage)

choices = random.sample(list(a[0]),27)

Using np.where to assign boolean values to new column in dataframe:

a['Bool'] = np.where(a[0].isin(choices),True,False)

Randomly select rows from DataFrame Pandas

3 Answers3