Train and validation split data based on a condition

Question

My data look like this:

Title  Source Y
aaaaa  a      1
bbbbb  a      0
ccccc  b      0
ddddd  c      0
eeeee  c      0
fffff  a      0
ggggg  b      0
hhhhh  c      1
iiiii  a      0
jjjjj  a      0
....
....
....

Being Y the expected value Data with Y = 1 --> 20% Data with Y = 0 --> 80%

I´m doing a dataset split in this way. Note: train_val_split = 0.4

def split_dataset(self, dataset: Dataset | DatasetDict) -> Dataset | DatasetDict:
        if self.train_val_split is not None:
            split = dataset["train"].train_test_split(self.train_val_split)
            dataset["train"] = split["train"]
            dataset["validation"] = split["test"]
        dataset = self._select_samples(dataset)
        return dataset

And i´m getting this

Training set

Title  Source Y
aaaaa  a      1
ddddd  c      0
eeeee  c      0
fffff  a      0
ggggg  b      0
hhhhh  c      1

Test set

Title  Source Y

bbbbb  a      0
ccccc  b      0
iiiii  a      0
jjjjj  a      0

i would like to split the data keeping the percentages of the initial dataset, in other words, i would like to get something like this

Title  Source Y

ccccc  b      0
ddddd  c      0
eeeee  c      0
fffff  a      0
ggggg  b      0
hhhhh  c      1

Test set

Title  Source Y
aaaaa  a      1
bbbbb  a      0
iiiii  a      0
jjjjj  a      0

Is there any way of doing this?

Are you looking for the option `stratify=y` [check here](https://stackoverflow.com/questions/34842405/parameter-stratify-from-method-train-test-split-scikit-learn) — Redox, Nov 04 '22 at 11:41

Train and validation split data based on a condition

0 Answers0