-1

I have a dataset whose Label is 0 or 1.

I want to divide my data into a testing and a training data sets. For this, I used the train_test_split method from scikit-learn at first.

But I want to select the test data in such a way that 10% of them are from class 0 and 90% are from class 1.

How can I do this?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
saraafr
  • 1
  • 4
  • 19
  • 2
    Please provide a minimal reproducible example https://stackoverflow.com/help/minimal-reproducible-example – sunnytown Nov 02 '22 at 08:25
  • If you are doing this for a ML project, then most likely you shouldn't be doing this. Data should be split equally among labels. – LLaP Nov 02 '22 at 08:32
  • Please do some research, read the documentation for `train_test_split` (which answers your question), and share your code. Then people can help you debug it. – Matt Hall Nov 02 '22 at 10:12
  • Does this answer your question? [Parameter "stratify" from method "train\_test\_split" (scikit Learn)](https://stackoverflow.com/questions/34842405/parameter-stratify-from-method-train-test-split-scikit-learn) – Matt Hall Nov 02 '22 at 10:12

3 Answers3

4

Refer to the official documentation sklearn.model_selection.train_test_split.

You want to specify the response variable with the stratify parameter when performing the split.

Stratification preserves the ratio of the class variable when the split is performed.

Dan Nagle
  • 4,384
  • 1
  • 16
  • 28
  • thanks a lot But I don't want to keep the ratio of original data classes in the test data. I want to manually specify the ratio of class 0 and 1 for the test data @Dan Nagle – saraafr Nov 05 '22 at 13:09
  • You could simply introduce a dummy field to the data that incorporates the preferred ratio and pass it as the `stratify` parameter. – Dan Nagle Nov 05 '22 at 13:53
1

You should write your own function to do this, One way to do this is select rows by index and shuffle it after take them.

ariankazemi
  • 105
  • 3
  • 14
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 15 '23 at 12:25
0

Split your dataset in class 1 and class 0, then split as you want:

df_0 = df.loc[df.class == 0]
df_1 = df.loc[df.class == 1]

test_0, train_0 = train_test_split(df_0, 0.1)
test_1, train_1 = train_test_split(df_1, 0.9)

test = pd.concat((test_0, test_1), 
                    axis = 1, 
                    ignore_index = True).sample(1) # sample(1) is to shuffle the df
train = pd.concat((train_0, train_1), 
                    axis = 1, 
                    ignore_index = True).sample(1)

JArmunia
  • 31
  • 7