Though I use machine learning-related terminology, my question is 100% engineering topic and it has nothing to do with statistics and mathematics. Therefore I ask it in this forum instead of Cross Validated.
This is my sample data that I will use to comment my question:
X = pd.DataFrame(columns=["F1","F2"],
data=[[23,0.8],
[11,5.35],
[24,19.18],
[15,10.25],
[10,11.30],
[55,44.85],
[15,33.88],
[12,45.30],
[14,22.20],
[15,15.80],
[83,0.8],
[51,5.35],
[34,30.28],
[35,15.25],
[60,13.30],
[75,44.80],
[35,30.77],
[62,40.33],
[64,23.40],
[14,11.80]])
y = pd.DataFrame(columns=["y"],
data=[[0],
[0],
[1],
[0],
[2],
[2],
[2],
[1],
[0],
[1],
[0],
[0],
[1],
[0],
[1],
[0],
[1],
[1],
[0],
[2]])
I should split data into training and testing sets. A classical way is to use train_test_split
function of sklearn
:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25)
But I want to specify % of records to be assigned to the training and testing sets. More details are explained below.
In my case I deal with a multi-class classification problem, in which y
may take one of 3 different values: 0, 1, 2. The records with the value 2 are very rare (in my real data set, approx 3% of the whole dataset). Therefore this is an imbalanced classification problem.
Since this is an imbalanced classification problem, the records of the rare class are very important. Therefore I want to update model_selection.train_test_split
as follows: I want to assign % of records per class for the training and testing sets. For example, <50%, 60%, 90%> would mean that 90% of the rare class's records are assigned to the training set.
In my example, I would like to get, for instance, 3 records of y
equal to 2
in the training set (X_train
, y_train
), and 1 record in the testing set.
I googled for similar questions but haven't found anything.
To solve this task, I shuffled the initial data frame:
df = pd.concat([X, y], axis=1)
df = df.sample(frac=1).reset_index(drop=True)
However, I don't know how to proceed with the rest of tasks. Maybe there is some sklearn built-in function or some library that can do solve this problem?