3

I am dealing with an unbalanced classification problem, where my negative class is 1000 times more numerous than my positive class. My strategy is to train a deep neural network on a balanced (50/50 ratio) training set (I have enough simulated samples), and then use an unbalanced (1/1000 ratio) validation set to select the best model and optimise the hyperparameters.

Since the number of parameters is significant, I want to use scikit-learn RandomizedSearchCV, i.e. a random grid search.

To my understanding, sk-learn GridSearch applies a metric on the training set to select the best set of hyperparameters. In my case however, this means that the GridSearch will select the model that performs best against a balanced training set, and not against more realistic unbalanced data.

My question would be: is there a way to grid search with the performances estimated on a specific, user-defined validation set?

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Milleuros
  • 87
  • 2
  • 9
  • Maybe [PredefinedSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html#sklearn.model_selection.PredefinedSplit) is what you need - Read more [here](http://scikit-learn.org/stable/modules/cross_validation.html#predefined-fold-splits-validation-sets). Pass this to `cv` parameter of RandomizedSearchCV or GridSearchCV. – Vivek Kumar May 03 '17 at 16:05
  • It looks like it is. Thank you a lot! By the way with the "PredefinedSplit" keyword I was able to find [a similar StackOverflow question](http://stackoverflow.com/questions/31948879/using-explict-predefined-validation-set-for-grid-search-with-sklearn). Can I choose your comment as accepted answer? – Milleuros May 03 '17 at 16:28
  • 1
    Possible duplicate of [Using explict (predefined) validation set for grid search with sklearn](http://stackoverflow.com/questions/31948879/using-explict-predefined-validation-set-for-grid-search-with-sklearn) – Arya McCarthy May 03 '17 at 16:32
  • No need. Just upvote that answer there, if it helps you. If any doubt or difficulty in implementing it, comment here with updated question. – Vivek Kumar May 03 '17 at 16:35
  • It is indeed a duplicate, I'll mark it as such. But before that: I have trouble using the _PredefinedSplit_ method, I don't really get how it works. If I have e.g. numpy arrays: X_train, Y_train, X_test, Y_test, how can I use the method? – Milleuros May 03 '17 at 16:35
  • I did not understand. Is X_test your validation set?? – Vivek Kumar May 03 '17 at 16:39
  • @VivekKumar Yes it is. X contains my variables, Y contains my labels. Typically when not grid searching, I would first use `classifier.fit(X_train,Y_train)` and then later use `classifier.score(X_test,Y_test)` to see the performances, and then loop back on tuning the hyperparameters. In the other thread, the suggested line is `ps = PredefinedSplit(test_fold=your_test_fold)`, but I do not understand the format of `your_test_fold`, what infos it should contain, etc. – Milleuros May 03 '17 at 16:45
  • Now that you say like that, do you need to train the data only on `X_train` and apply the metric only on the `X_test`? So is the GridSearchCV or RandomizedSearchCV, the training and testing data does not change ever, and is fixed to `X_train` and `X_test`. – Vivek Kumar May 03 '17 at 17:07
  • @VivekKumar Yes I do. – Milleuros May 03 '17 at 17:09

1 Answers1

8

As suggested in comments, the thing you need is PredefinedSplit. It is described in the question here

As about the working, you can see the example given in the documentation:

from sklearn.model_selection import PredefinedSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

#This is what you need
test_fold = [0, 1, -1, 1]

ps = PredefinedSplit(test_fold)
ps.get_n_splits()
#OUTPUT
2

for train_index, test_index in ps.split():
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

#OUTPUT
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2] TEST: [1 3]

As you can see here, you need to assign the test_fold a list of indices, which will be used to split the data. -1 will be used for index of samples, which are not included in validation set.

So in the above code, test_fold = [0, 1, -1, 1] says that in 1st validation set (indices in samples, whose value =0 in test_fold), index 0. And 2nd is where test_fold have value =1, so index 1 and 3.

But when you say that you have X_train and X_test, if you want your validation set only from X_test, then you need to do the following:

my_test_fold = []

# put -1 here, so they will be in training set
for i in range(len(X_train)):
    my_test_fold.append(-1)

# for all greater indices, assign 0, so they will be put in test set
for i in range(len(X_test)):
    my_test_fold.append(0)

#Combine the X_train and X_test into one array:
import numpy as np

clf = RandomizedSearchCV( ...    cv = PredefinedSplit(test_fold=my_test_fold))
clf.fit(np.concatenate((X_train, X_test), axis=0), np.concatenate((y_train, y_test), axis=0))
Community
  • 1
  • 1
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132