Splitting dataset to train and test in Python

Question

I have a dataset whose Label is 0 or 1.

I want to divide my data into a testing and a training data sets. For this, I used the train_test_split method from scikit-learn at first.

But I want to select the test data in such a way that 10% of them are from class 0 and 90% are from class 1.

How can I do this?

Please provide a minimal reproducible example https://stackoverflow.com/help/minimal-reproducible-example — sunnytown, Nov 02 '22 at 08:25
If you are doing this for a ML project, then most likely you shouldn't be doing this. Data should be split equally among labels. — LLaP, Nov 02 '22 at 08:32
Please do some research, read the documentation for `train_test_split` (which answers your question), and share your code. Then people can help you debug it. — Matt Hall, Nov 02 '22 at 10:12
Does this answer your question? [Parameter "stratify" from method "train\_test\_split" (scikit Learn)](https://stackoverflow.com/questions/34842405/parameter-stratify-from-method-train-test-split-scikit-learn) — Matt Hall, Nov 02 '22 at 10:12

score 4 · Answer 1 · answered Nov 02 '22 at 08:33

4

Refer to the official documentation sklearn.model_selection.train_test_split.

You want to specify the response variable with the stratify parameter when performing the split.

Stratification preserves the ratio of the class variable when the split is performed.

answered Nov 02 '22 at 08:33

Dan Nagle

4,384
1
16
28

thanks a lot But I don't want to keep the ratio of original data classes in the test data. I want to manually specify the ratio of class 0 and 1 for the test data @Dan Nagle – saraafr Nov 05 '22 at 13:09
You could simply introduce a dummy field to the data that incorporates the preferred ratio and pass it as the `stratify` parameter. – Dan Nagle Nov 05 '22 at 13:53

score 1 · Answer 2 · answered Feb 12 '23 at 15:03

1

You should write your own function to do this, One way to do this is select rows by index and shuffle it after take them.

answered Feb 12 '23 at 15:03

ariankazemi

105
3
14

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 15 '23 at 12:25

score 0 · Answer 3 · answered Nov 02 '22 at 08:46

Split your dataset in class 1 and class 0, then split as you want:

df_0 = df.loc[df.class == 0]
df_1 = df.loc[df.class == 1]

test_0, train_0 = train_test_split(df_0, 0.1)
test_1, train_1 = train_test_split(df_1, 0.9)

test = pd.concat((test_0, test_1), 
                    axis = 1, 
                    ignore_index = True).sample(1) # sample(1) is to shuffle the df
train = pd.concat((train_0, train_1), 
                    axis = 1, 
                    ignore_index = True).sample(1)

Splitting dataset to train and test in Python

3 Answers3