I would like to divide the set into training and test in a 50:50 ratio according to the class 'fruit'. However, so that classes with the same ID go into either the training or test set.
Here is an example data:
import pandas as pd
import random
from sklearn.model_selection import GroupShuffleSplit
df = pd.DataFrame({'fruit': ['watermelon', 'watermelon', 'watermelon', 'watermelon', 'watermelon',
'apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'apple',
"lemon", "lemon"],
'ID': [1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6, 6 , 7 ,8],
'value1': random.sample(range(10, 100), 14),
'value2': random.sample(range(10, 100), 14) })
I try:
X = df[['value1', 'value2']]
y = df['fruit']
groups = df['ID']
gss = GroupShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
train_idx, test_idx = next(gss.split(X, y, groups))
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
Therefore, for example the watermelon class: three rows will go into the training set (with ID = 1) and two rows will go into the test set (with ID = 2). And same with apple and lemon. However, it divides the set badly that, for example, a class of lemons goes into training or testing and there should be 1 line each in this and that.