0

Is there a way to split a pandas dataframe into multiple, mutually exclusive samples (of different length) stratified on a variable?

My current approach is to use train_test_split from sci-kit learn multiple times for each sample, but feels very inefficient.

cell_to_split, cell_1 = train_test_split(data, test_size=50, stratify=strat_variable)
cell_to_split, cell_2 = train_test_split(cell_to_split, test_size=60, stratify=strat_variable)
cell_to_split, cell_3 = train_test_split(cell_to_split, test_size=40, stratify=strat_variable)

# strat_variable here is a string variable in data or cell_to_split i'm using for random stratified sampling

This lets me get 3 samples from the dataset with specified size (number of rows) in each, balanced for representativeness on my strat_variable, but isn't too efficient, and I'd ideally like the number of samples (set as 3 here) to be dynamic.

SH_6778
  • 1
  • 1
  • Try Stratified K-Folds cross-validator https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html – Pygirl Apr 05 '20 at 02:58
  • Thanks for the qiuck reply. I may have cases where I need the samples to be different sizes though, don't see a way to specify that in Strat Fold. Also updated my question to make that clearer – SH_6778 Apr 05 '20 at 03:01
  • then try Stratified ShuffleSplit cross-validator. https://stackoverflow.com/questions/45500915/how-to-give-the-test-size-in-stratified-kfold-sampling-in-python – Pygirl Apr 05 '20 at 03:04
  • This answer is closest to your requirement https://stackoverflow.com/a/39501510/6660373 – Pygirl Apr 05 '20 at 03:07

0 Answers0