0

My data consists of 99% target variable = 1, and 1% target variable = '0'. Does stratify guarantee that the train tests and test sets have equal ratio of data in terms of target variable. As in containts, equal amounts of '1' and '0'?

Please see below code for clarification

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=y,random_state=42)
Pratik Kumar
  • 2,211
  • 1
  • 17
  • 41

2 Answers2

0

First difference is that the train_test_split(X, y, test_size=0.2, stratify=y) will only split the data once and in which 80% will be in train and 20% in test.

Whereas StratifiedKFold(n_splits=2) will split the data into 50% train and 50% test.

Second is that you can specify n_splits greater than 2 to achieve a cross-validation fold effect, in which the data will splitted n_split number of times. So there will be multiple divisions of data into train and test.

For more information about the K-fold you can look at this question:

The idea is same in that. train_test_split will internally use StratifiedShuffleSplit

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • in the first sentence of your answer, does the 'stratify=y' guarantee equal splits of the data to counter imbalance classes ? –  Jan 23 '18 at 13:58
  • @MajidHelmy What do you mean by "equal splits"? The ratio of data will be maintained based on the classes. – Vivek Kumar Jan 23 '18 at 13:59
  • well actually that is my main question, my data consists of 99% target variable = 1, and 1% target variable = '0'. Does stratify guarantee that the train tests and test sets have equal ratio of data in terms of target variable. As in containts, equal amounts of '1' and '0'. @Vivek –  Jan 23 '18 at 14:01
  • @MajidHelmy If by equal you mean same number of samples for both classes then No. The ratio of classes in the new split parts will be equal to ratio of classes of the whole data before the split. – Vivek Kumar Jan 23 '18 at 14:37
0

Stratification will just return a portion of data which may be shuffled or not based on the arguments you pass to it. let's say your dataset consists of 100 instances of class 1 and 10 instances of class 0, you decide to do a split of 70:30, suppose you pass the appropriate parameters to get a split of 63-class1 instances and 7-class0 instances in training set and 27-class1 instances and 3-class0 instances in the test set. Clearly, it is no way balanced. The classifier you train will be highly biased and as good as a dummy classifier which predicts every input as class1.

A better approach would be, either try to collect more data of class-0, or oversample the dataset to artificially generate more class0 instances or undersample it to get less class1 instances. python imblearn is a library in python which can help you for that

Pratik Kumar
  • 2,211
  • 1
  • 17
  • 41