3

I have a question and I have looked for answers but I couldn't find an answer.

if i have a dataset labeled using three or more classes where each class represent 33% of the data. When I split my data does the training/validation/test sets keep the same balance between the classes?

If no is there a way to keep the balance?

Thanks in advance.

leila
  • 461
  • 1
  • 7
  • 21
  • Possible duplicate of [Stratified Train/Test-split in scikit-learn](https://stackoverflow.com/questions/29438265/stratified-train-test-split-in-scikit-learn) – Venkatachalam Feb 13 '19 at 16:53

1 Answers1

7

found it!

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

leila
  • 461
  • 1
  • 7
  • 21
  • 1
    what does this do? – Nicolas Gervais Jul 21 '20 at 15:34
  • 1
    it stratifies the data in the train/test sets and keeps the number ofclasses balanced, eg if you have 100of class1 and 100 of a class2, when you split with 0.2 test size you will get a train set with 80of class1 and 80 of class2 and a testset of 20 of class1 and 20 of class2 – leila Jul 21 '20 at 15:43