1

I need to select some % of records from my dataframe for my analysis, lets say 33% of record I need to select from my dataframe, which has 100 records(as an example). I need to select randomly 33 records from my dataframe. I tried "random.randint", but this is not giving exactly 33% of records, it gives approximately 33% of records only.Below is my code:

DF_1['ran'] = [random.randint(0,99)  for k in DF_1.index]

DF_2=DF_1[DF_1['ran']<33] 

Do we have any other functions to get exact % of records from dataframe?. Thank you in advance. Alex

jpp
  • 159,742
  • 34
  • 281
  • 339
Alexsander
  • 603
  • 1
  • 6
  • 15

1 Answers1

2

randint in a list comprehension won't guarantee an even distribution, nor will it guarantee no duplicates.

With the random module, you can use random.sample, which gives a sample without replacement:

from random import sample

num = int(len(Mission_3_0A.index) * 0.33)  # e.g. for 33%
indices = sample(Mission_3_0A.index, k=num)
DF_2 = DF_1.loc[indices].copy()

With NumPy, you can use np.random.choice, specifying replace=False:

indices = np.random.choice(Mission_3_0A.index, size=num, replace=False)
DF_2 = DF_1.loc[indices].copy()

Most idiomatic is to use pd.DataFrame.sample:

DF_2 = DF_1.sample(n=num)     # absolute number
DF_2 = DF_1.sample(frac=1/3)  # give fraction (floored if not whole)
jpp
  • 159,742
  • 34
  • 281
  • 339