2

I have looked into Stratified sample in pandas, stratified sampling on ranges, among others and they don't assess my issue specifically, as I'm looking to split the data into 3 sets randomly.

I have an unbalanced dataframe of 10k rows, 10% is positive class, 90% negative class. I'm trying to figure out a way to split this dataframe into 3 datasets, as 60%, 20%, 20% of the dataframe considering the unbalance. However, this split has to be random and non-replaceable, which means if I put together the 3 datasets, it has to be equal to the original dataframe.

Usually I would use train_test_split() but it only works if you are looking to split into two, not three datasets.

Any suggestions?

Reproducible example:

df = pd.DataFrame({"target" : np.random.choice([0,0,0,0,0,0,0,0,0,1], size=10000)}, index=range(0,10000,1))
Chris
  • 2,019
  • 5
  • 22
  • 67

1 Answers1

1

How about using train_test_split() twice? 1st time, using train_size=0.6, obtaining a 60% training set and 40% (test + valid) set. 2nd time, using train_size=0.5, obtaining a 50%*40%=20% validation and 20% test. Is this workaround valid for you?

yonatansc97
  • 584
  • 6
  • 16
  • it's a workaround, but strictly speaking is not random as the last split is a subsplit from a previous split. – Chris Sep 30 '20 at 17:04
  • Could you explain a bit more? If by random, you mean that every sample has a 60% chance of being in the train, 20% chance of being in the validation and 20% chance of being the test, and this holds for all samples just the same, then the method above should be random. Why do you think it isn't? – yonatansc97 Oct 01 '20 at 04:03