2

I am new to python and I have been trying to figure out how gridsearchCV and cross_val_score work.

Finding odds results a set up a sort of validation experiment, but still I do not understand what I am doing wrong.

To try to simplify I am using gridsearchCV is the simplest possible way and try to validate and understand what is happening:

Here it is:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression, RFECV
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV,Ridge, LinearRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV,KFold,TimeSeriesSplit,PredefinedSplit,cross_val_score
from sklearn.metrics import mean_squared_error,make_scorer,r2_score,mean_absolute_error,mean_squared_error
from math import sqrt

I create a cross validation object (for gridsearchCV and cross_val_score) and a train/test dataset for pipeline and simple linear regression. I have checked that the two dataset are identical:

train_indices = np.full((15,), -1, dtype=int)
test_indices = np.full((6,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)
kf = PredefinedSplit(test_fold)

for train_index, test_index in kf.split(X):
    print('TRAIN:', train_index, 'TEST:', test_index)
    X_train_kf = X[train_index]
    X_test_kf = X[test_index]

train_data = list(range(0,15))
test_data = list(range(15,21))

X_train, y_train=X[train_data,:],y[train_data]
X_test, y_test=X[test_data,:],y[test_data]

Here is what I do:

instantiate a simple linear model and use it with the manual set of data

lr=LinearRegression()
lm=lr.fit(X,y)
lmscore_train=lm.score(X_train,y_train) 

->r2=0.4686662249071524

lmscore_test=lm.score(X_test,y_test)

->r2 0.6264021467338086

now I try do do the exact same things using a pipeline:

pipe_steps = ([('est', LinearRegression())])
pipe=Pipeline(pipe_steps)
p=pipe.fit(X,y)
pscore_train=p.score(X_train,y_train) 

->r2=0.4686662249071524

pscore_test=p.score(X_test,y_test)

->r2 0.6264021467338086

LinearRegression and pipeline matches perfectly

Now I try to do the same by using cross_val_score using the predefined split kf

cv_scores = cross_val_score(lm, X, y, cv=kf)  

->r2 = -1.234474757883921470e+01?!?! (this is supposed to be the test score)

Now let's try gridsearchCV

scoring = {'r_squared':'r2'}
grid_parameters = [{}] 
gridsearch=GridSearchCV(p, grid_parameters, verbose=3,cv=kf,scoring=scoring,return_train_score='true',refit='r_squared')
gs=gridsearch.fit(X,y)
results=gs.cv_results_

from cv_results_ I get once again ->mean_test_r_squared->r2->-1.234474757883921292e+01

So cross_val_score and gridsearch in the end match one another, but the score is totally off and different from what should be.

Will you please help me out solving this puzzle?

nuric
  • 11,027
  • 3
  • 27
  • 42
Luca Fichera
  • 43
  • 1
  • 3
  • cross_val_score will do a cross validation and calculate the score on each split. GridSearchCV will perform a cross validation on each parameter of your grid and calculate the score. It's used to benchmark hyper parameters. Also building a model on a sample of 15 observations might nor be enough. – Mohamed AL ANI May 31 '18 at 17:05

1 Answers1

1

cross_val_score and GridSearchCV will first split the data, train the model on the train data only and then score on test data.

Here you are training on the full data, and then scoring on test data. Hence you dont match the results of cross_val_score.

Instead of this:

lm=lr.fit(X,y)

Try this:

lm=lr.fit(X_train, y_train)

Same for pipeline:

Instead of p=pipe.fit(X,y), do this:

p=pipe.fit(X_train, y_train)

You can look at my answers for more description:-

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Hello Vivek, you are right. Using lm=lr.fit(X_train, y_train) gives -> r2=-12.44 in lm and pipeline as well. What also bothered me is that in my knowledge r2 is by design always included between 0 and 1 beinf =g the determination coefficient. Do you have a clue at why a do get -12.44? Thank again for your help. – Luca Fichera Jun 01 '18 at 12:46
  • No. r2 can be negative. Please look at these posts: [Post1](https://stats.stackexchange.com/q/12900/133411) and [Post2](https://stats.stackexchange.com/q/183265/133411) – Vivek Kumar Jun 01 '18 at 13:04
  • @LucaFichera If this answer helped, please consider [accepting the answer](https://stackoverflow.com/help/someone-answers) – Vivek Kumar Jun 01 '18 at 13:11
  • Hello, finally I was able to cross validate everything through excel. Actually I also find useful the explanation given in scikit-learn it self. http://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination – Luca Fichera Jun 04 '18 at 10:06