I'm working on a multi class classification problem with a data set with unbalanced labels and I want to investigate how my algorithm performs in the small sample size regime.
What I want to do is specifically create my training set by selecting p% of each class uniformly at random. Specifically, suppose I have classes and counts of {(A,20), (B,40), (C,90)} where the types are (ClassName, NumSamples). I'd love to be able to sample 10% of each class to get a training set {(A,2),(B,4),(C,9)}.
I tried to do this
X_trn, X_tst, y_trn, y_tst = train_test_split(X,y,test_size=0.9,stratify=y)
and the numbers that I get from doing
print(pd.Series(y_trn).value_counts())
print(pd.Series(y_tst).value_counts())
print(X.shape)
print(X_trn.shape)
suggest I'm getting what I want, but I want to double check before I go further down the road.