1

Though I use machine learning-related terminology, my question is 100% engineering topic and it has nothing to do with statistics and mathematics. Therefore I ask it in this forum instead of Cross Validated.

This is my sample data that I will use to comment my question:

X = pd.DataFrame(columns=["F1","F2"], 
                  data=[[23,0.8],
                        [11,5.35],
                        [24,19.18],
                        [15,10.25],
                        [10,11.30],
                        [55,44.85],
                        [15,33.88],
                        [12,45.30],
                        [14,22.20],
                        [15,15.80],
                        [83,0.8],
                        [51,5.35],
                        [34,30.28],
                        [35,15.25],
                        [60,13.30],
                        [75,44.80],
                        [35,30.77],
                        [62,40.33],
                        [64,23.40],
                        [14,11.80]])

y = pd.DataFrame(columns=["y"], 
                  data=[[0],
                        [0],
                        [1],
                        [0],
                        [2],
                        [2],
                        [2],
                        [1],
                        [0],
                        [1],
                        [0],
                        [0],
                        [1],
                        [0],
                        [1],
                        [0],
                        [1],
                        [1],
                        [0],
                        [2]])

I should split data into training and testing sets. A classical way is to use train_test_split function of sklearn:

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25)

But I want to specify % of records to be assigned to the training and testing sets. More details are explained below.

In my case I deal with a multi-class classification problem, in which y may take one of 3 different values: 0, 1, 2. The records with the value 2 are very rare (in my real data set, approx 3% of the whole dataset). Therefore this is an imbalanced classification problem.

Since this is an imbalanced classification problem, the records of the rare class are very important. Therefore I want to update model_selection.train_test_split as follows: I want to assign % of records per class for the training and testing sets. For example, <50%, 60%, 90%> would mean that 90% of the rare class's records are assigned to the training set.

In my example, I would like to get, for instance, 3 records of y equal to 2 in the training set (X_train, y_train), and 1 record in the testing set.

I googled for similar questions but haven't found anything.

To solve this task, I shuffled the initial data frame:

df = pd.concat([X, y], axis=1)

df = df.sample(frac=1).reset_index(drop=True)

However, I don't know how to proceed with the rest of tasks. Maybe there is some sklearn built-in function or some library that can do solve this problem?

ScalaBoy
  • 3,254
  • 13
  • 46
  • 84

1 Answers1

2

There is an option called stratify, in train_test_split. also take a look at this kind of similar question

To accomplish the proportions that you need, you can use np.random.choice from numpy:

import numpy as np
df = pd.concat([X,y], axis = 1)

#get index values for y = 0
i0 = np.random.choice(df.loc[df.y==0].index.values,
round(len(df.loc[df.y==0])*.5), replace = False)

i1 = np.random.choice(df.loc[df.y==1].index.values,
round(len(df.loc[df.y==1])*.6), replace = False)

i2 = np.random.choice(df.loc[df.y==2].index.values,
round(len(df.loc[df.y==1])*.9), replace = False)

df_train = df.loc[df.index.isin(np.concatenate([i1,i2,i0]))]
df_test = df.loc[~df.index.isin(np.concatenate([i1,i2,i0]))]
Bruno Carballo
  • 1,156
  • 8
  • 15
  • Thanks. I read about `stratify`. For example, it's possible to specify `stratify=y`. However I cannot understand how to apply it to solving my task. Can you put an example? – ScalaBoy Jan 04 '19 at 19:35
  • Please substitute `dnp` with `np` – ScalaBoy Jan 04 '19 at 20:07