6

I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but approximately all of my train data are type1. So I cant oversample.

Previously I used to split train test datasets with my written code. In that code 0.8 of all type1 data and 0.8 of all type2 data were in the train dataset.

How I can use this method with train_test_split function or other spliting methods in sklearn?

*I should just use sklearn or my own written methods.

Maryam
  • 119
  • 1
  • 1
  • 6

3 Answers3

16

You're looking for stratification. Why?

There's a parameter stratify in method train_test_split to which you can give the labels list e.g. :

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.2)

There's also StratifiedShuffleSplit.

arnaud
  • 3,293
  • 1
  • 10
  • 27
  • But still, I think there is no sklearn method to implement oversampling. – Maryam May 19 '20 at 07:36
  • Nope, but here's a library built a la sklearn that could prove useful: `imbalanced-learn` https://github.com/scikit-learn-contrib/imbalanced-learn – arnaud May 19 '20 at 07:44
2

It seems like we both had similar issues here. Unfortunately, imbalanced-learn isn't always what you need and scikit does not offer the functionality you want. You will want to implement your own code.

This is what I came up for my application. Note that I have not had extensive time to debug it but I believe it works from the testing I have done. Hope it helps:

def equal_sampler(classes, data, target, test_frac):
    
    # Find the least frequent class and its fraction of the total
    _, count = np.unique(target, return_counts=True)
    fraction_of_total = min(count) / len(target)
    
    # split further into train and test
    train_frac = (1-test_frac)*fraction_of_total
    test_frac = test_frac*fraction_of_total
    
    # initialize index arrays and find length of train and test
    train=[]
    train_len = int(train_frac * data.shape[0])
    test=[]
    test_len = int(test_frac* data.shape[0])
    
    # add values to train, drop them from the index and proceed to add to test
    for i in classes:
        indeces = list(target[target ==i].index.copy())
        train_temp = np.random.choice(indeces, train_len, replace=False)
        for val in train_temp:
            train.append(val)
            indeces.remove(val)
        test_temp = np.random.choice(indeces, test_len, replace=False)
        for val in test_temp:
            test.append(val)
    
    # X_train, y_train, X_test, y_test
    return data.loc[train], target[train], data.loc[test], target[test] 

For the input, classes expects a list of possible values, data expects the dataframe columns used for prediction, target expects the target column.

Take care that the algorithm may not be extremely efficient, due to the triple for-loop(list.remove takes linear time). Despite that, it should be reasonably fast.

Vlado
  • 21
  • 3
2

You may also look into stratified shuffle split as follows:

 # We use a utility to generate artificial classification data.
 from sklearn.datasets import make_classification
 from sklearn.model_selection import StratifiedShuffleSplit
 from sklearn.svm import SVC
 from sklearn.pipeline import make_pipeline

 X, y = make_classification(n_samples=100, n_informative=10, n_classes=2)
 sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
 for train_index, test_index in sss.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
zeroandone
  • 21
  • 1