I am building an MLPClassifier model in sci-kit learn. I used gridSearchCV with roc_auc to score the model. Mean train and test scores are around 0.76, not bad. The output of cv_results_
is:
Train set AUC: 0.553465272412
Grid best score (AUC): 0.757236688092
Grid best parameter (max. AUC): {'hidden_layer_sizes': 10}
{ 'mean_fit_time': array([63.54, 136.37, 136.32, 119.23, 121.38, 124.03]),
'mean_score_time': array([ 0.04, 0.04, 0.04, 0.05, 0.05, 0.06]),
'mean_test_score': array([ 0.76, 0.74, 0.75, 0.76, 0.76, 0.76]),
'mean_train_score': array([ 0.76, 0.76, 0.76, 0.77, 0.77, 0.77]),
'param_hidden_layer_sizes': masked_array(data = [5 (5, 5) (5, 10) 10 (10, 5) (10, 10)],
mask = [False False False False False False],
fill_value = ?)
,
'params': [ {'hidden_layer_sizes': 5},
{'hidden_layer_sizes': (5, 5)},
{'hidden_layer_sizes': (5, 10)},
{'hidden_layer_sizes': 10},
{'hidden_layer_sizes': (10, 5)},
{'hidden_layer_sizes': (10, 10)}],
'rank_test_score': array([ 2, 6, 5, 1, 4, 3]),
'split0_test_score': array([ 0.76, 0.75, 0.75, 0.76, 0.76, 0.76]),
'split0_train_score': array([ 0.76, 0.75, 0.75, 0.76, 0.76, 0.76]),
'split1_test_score': array([ 0.77, 0.76, 0.76, 0.77, 0.76, 0.76]),
'split1_train_score': array([ 0.76, 0.75, 0.75, 0.76, 0.76, 0.76]),
'split2_test_score': array([ 0.74, 0.72, 0.73, 0.74, 0.74, 0.75]),
'split2_train_score': array([ 0.77, 0.77, 0.77, 0.77, 0.77, 0.77]),
'std_fit_time': array([47.59, 1.29, 1.86, 3.43, 2.49, 9.22]),
'std_score_time': array([ 0.01, 0.01, 0.01, 0.00, 0.00, 0.01]),
'std_test_score': array([ 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]),
'std_train_score': array([ 0.01, 0.01, 0.01, 0.01, 0.01, 0.00])}
As you can see I use a KFold of 3. Interestingly the roc_auc_score of the train set computed manually is reported as 0.55, while the mean train score is reported as ~0.76. The code to generate this output is:
def model_mlp (X_train, y_train, verbose=True, random_state = 42):
grid_values = {'hidden_layer_sizes': [(5), (5,5), (5, 10),
(10), (10, 5), (10, 10)]}
# MLP requires scaling of all predictors
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
mlp = MLPClassifier(solver='adam', learning_rate_init=1e-4,
max_iter=200,
verbose=False,
random_state=random_state)
# perform the grid search
grid_auc = GridSearchCV(mlp,
param_grid=grid_values,
scoring='roc_auc',
verbose=2, n_jobs=-1)
grid_auc.fit(X_train, y_train)
y_hat = grid_auc.predict(X_train)
# print out the results
if verbose:
print('Train set AUC: ', roc_auc_score(y_train, y_hat))
print('Grid best score (AUC): ', grid_auc.best_score_)
print('Grid best parameter (max. AUC): ', grid_auc.best_params_)
print('')
pp = pprint.PrettyPrinter(indent=4)
pp.pprint (grid_auc.cv_results_)
print ('MLPClassifier fitted, {:.2f} seconds used'.format (time.time () - t))
return grid_auc.best_estimator_
Because of this difference I decided to 'emulate' the GridSearchCV
routine and got the following results:
Shape X_train: (107119, 15)
Shape y_train: (107119,)
Shape X_val: (52761, 15)
Shape y_val: (52761,)
layers roc-auc
Seq l1 l2 train test iters runtime
1 5 0 0.5522 0.5488 85 20.54
2 5 5 0.5542 0.5513 80 27.10
3 5 10 0.5544 0.5521 83 28.56
4 10 0 0.5532 0.5516 61 15.24
5 10 5 0.5540 0.5518 54 19.86
6 10 10 0.5507 0.5474 56 21.09
The scores are all around 0.55, consistent with the manual computation in the code above. What surprised me is the lack of variation in the results. It appears as if I am making some mistake, but I cannot find one, see the code:
def simple_mlp (X, y, verbose=True, random_state = 42):
def do_mlp (X_t, X_v, y_t, y_v, n, l1, l2=None):
if l2 is None:
layers = (l1)
l2 = 0
else:
layers = (l1, l2)
t = time.time ()
mlp = MLPClassifier(solver='adam', learning_rate_init=1e-4,
hidden_layer_sizes=layers,
max_iter=200,
verbose=False,
random_state=random_state)
mlp.fit(X_t, y_t)
y_hat_train = mlp.predict(X_t)
y_hat_val = mlp.predict(X_v)
if verbose:
av = 'samples'
acc_trn = roc_auc_score(y_train, y_hat_train, average=av)
acc_tst = roc_auc_score(y_val, y_hat_val, average=av)
print ("{:5d}{:4d}{:4d}{:7.4f}{:7.4f}{:9d}{:8.2f}"
.format(n, l1, l2, acc_trn, acc_tst, mlp.n_iter_, time.time() - t))
return mlp, n + 1
X_train, X_val, y_train, y_val = train_test_split (X, y, test_size=0.33, random_state=random_state)
if verbose:
print('Shape X_train:', X_train.shape)
print('Shape y_train:', y_train.shape)
print('Shape X_val:', X_val.shape)
print('Shape y_val:', y_val.shape)
# MLP requires scaling of all predictors
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
n = 1
layers1 = [5, 10]
layers2 = [5, 10]
if verbose:
print (" layers roc-auc")
print (" Seq l1 l2 train validation iters runtime")
for l1 in layers1:
mlp, n = do_mlp (X_train, X_val, y_train, y_val, n, l1)
for l2 in layers2:
mlp, n = do_mlp (X_train, X_val, y_train, y_val, n, l1, l2)
return mlp
I use exactly the same data in both cases (159880 observations and 15 predictors). I use cv=3
(the default) for GridSearchCV
and use the same proportion for the validation set in my handcrafted code.
When searching for a possible answer I found this post on SO which describes the same problem. There was no answer. Maybe someone understands what exactly is happening?
Thanks for your time.
Edit
I checked the code of GridSearchCV and KFold as @Mohammed Kashif suggested and indeed found an explicit remark that KFold did not shuffle the data. So I added the following code to model_mlp before the scaler:
np.random.seed (random_state)
index = np.random.permutation (len(X_train))
X_train = X_train.iloc[index]
and into simple_mlp as a replacement of train_test_split:
np.random.seed (random_state)
index = np.random.permutation (len(X))
X = X.iloc[index]
y = y.iloc[index]
train_size = int (2 * len(X) / 3.0) # sample of 2 third
X_train = X[:train_size]
X_val = X[train_size:]
y_train = y[:train_size]
y_val = y[train_size:]
Which resulted in the following output:
Train set AUC: 0.5
Grid best score (AUC): 0.501410198106
Grid best parameter (max. AUC): {'hidden_layer_sizes': (5, 10)}
{ 'mean_fit_time': array([28.62, 46.00, 54.44, 46.74, 55.25, 53.33]),
'mean_score_time': array([ 0.04, 0.05, 0.05, 0.05, 0.05, 0.06]),
'mean_test_score': array([ 0.50, 0.50, 0.50, 0.50, 0.50, 0.50]),
'mean_train_score': array([ 0.50, 0.51, 0.51, 0.51, 0.50, 0.51]),
'param_hidden_layer_sizes': masked_array(data = [5 (5, 5) (5, 10) 10 (10, 5) (10, 10)],
mask = [False False False False False False],
fill_value = ?)
,
'params': [ {'hidden_layer_sizes': 5},
{'hidden_layer_sizes': (5, 5)},
{'hidden_layer_sizes': (5, 10)},
{'hidden_layer_sizes': 10},
{'hidden_layer_sizes': (10, 5)},
{'hidden_layer_sizes': (10, 10)}],
'rank_test_score': array([ 6, 2, 1, 4, 5, 3]),
'split0_test_score': array([ 0.50, 0.50, 0.51, 0.50, 0.50, 0.50]),
'split0_train_score': array([ 0.50, 0.51, 0.50, 0.51, 0.50, 0.51]),
'split1_test_score': array([ 0.50, 0.50, 0.50, 0.50, 0.49, 0.50]),
'split1_train_score': array([ 0.50, 0.50, 0.51, 0.50, 0.51, 0.51]),
'split2_test_score': array([ 0.49, 0.50, 0.49, 0.50, 0.50, 0.50]),
'split2_train_score': array([ 0.51, 0.51, 0.51, 0.51, 0.50, 0.51]),
'std_fit_time': array([19.74, 19.33, 0.55, 0.64, 2.36, 0.65]),
'std_score_time': array([ 0.01, 0.01, 0.00, 0.01, 0.00, 0.01]),
'std_test_score': array([ 0.01, 0.00, 0.01, 0.00, 0.00, 0.00]),
'std_train_score': array([ 0.00, 0.00, 0.00, 0.00, 0.00, 0.00])}
which appears to confirm Mohammeds remarks. I must say I was quite sceptical at first as I could not imagine such a strong impact of randomization on such big a dataset that does not really looks like ordered.
I have some doubts however. In the original setup GridSearchCV came out consistently too high by about 0.20, now it is consistently too low by about 0.05. This is an improvement as the deviation of both methods has decreased by a factor 4. Is there an explanation of the last finding or is a deviation between both methods by about 0.05 simply a fact of noise? I decided to mark this as the correct answer, but I hope somebody can shed some light upon my little doubt.