0

I'm working on a multi class classification problem with a data set with unbalanced labels and I want to investigate how my algorithm performs in the small sample size regime.

What I want to do is specifically create my training set by selecting p% of each class uniformly at random. Specifically, suppose I have classes and counts of {(A,20), (B,40), (C,90)} where the types are (ClassName, NumSamples). I'd love to be able to sample 10% of each class to get a training set {(A,2),(B,4),(C,9)}.

I tried to do this

X_trn, X_tst, y_trn, y_tst = train_test_split(X,y,test_size=0.9,stratify=y)

and the numbers that I get from doing

print(pd.Series(y_trn).value_counts())
print(pd.Series(y_tst).value_counts())
print(X.shape)
print(X_trn.shape)

suggest I'm getting what I want, but I want to double check before I go further down the road.

  • Does this answer your question? [Parameter "stratify" from method "train\_test\_split" (scikit Learn)](https://stackoverflow.com/questions/34842405/parameter-stratify-from-method-train-test-split-scikit-learn) – Alexander L. Hayes Sep 11 '22 at 14:55

1 Answers1

0

Panda's describe method may help you find the distribution after train_test_split.

print(X_trn["ClassName"].describe())

Hope it helps!

metc
  • 63
  • 6