My data look like this:
Title Source Y
aaaaa a 1
bbbbb a 0
ccccc b 0
ddddd c 0
eeeee c 0
fffff a 0
ggggg b 0
hhhhh c 1
iiiii a 0
jjjjj a 0
....
....
....
Being Y the expected value Data with Y = 1 --> 20% Data with Y = 0 --> 80%
I´m doing a dataset split in this way. Note: train_val_split = 0.4
def split_dataset(self, dataset: Dataset | DatasetDict) -> Dataset | DatasetDict:
if self.train_val_split is not None:
split = dataset["train"].train_test_split(self.train_val_split)
dataset["train"] = split["train"]
dataset["validation"] = split["test"]
dataset = self._select_samples(dataset)
return dataset
And i´m getting this
Training set
Title Source Y
aaaaa a 1
ddddd c 0
eeeee c 0
fffff a 0
ggggg b 0
hhhhh c 1
Test set
Title Source Y
bbbbb a 0
ccccc b 0
iiiii a 0
jjjjj a 0
i would like to split the data keeping the percentages of the initial dataset, in other words, i would like to get something like this
Title Source Y
ccccc b 0
ddddd c 0
eeeee c 0
fffff a 0
ggggg b 0
hhhhh c 1
Test set
Title Source Y
aaaaa a 1
bbbbb a 0
iiiii a 0
jjjjj a 0
Is there any way of doing this?