-1

How to perform standardizing on the data in GridSearchCV?

Here is the code. I have no idea on how to do it.

import dataset
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
dataset = pd.read_excel('../dataset/dataset_experiment1.xlsx')
X = dataset.iloc[:,1:-1].values
y = dataset.iloc[:,66].values

from sklearn.model_selection import GridSearchCV
#from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
stdizer = StandardScaler()

print('===Grid Search===')

print('logistic regression')
model = LogisticRegression()
parameter_grid = {'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
grid_search = GridSearchCV(model, param_grid=parameter_grid, cv=kfold, scoring = scoring3)
grid_search.fit(X, y)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
print('\n')

Update This is what I try to run but get the error:

print('logistic regression')
model = LogisticRegression()
pipeline = Pipeline([('scale', StandardScaler()), ('clf', model)])
parameter_grid = {'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
grid_search = GridSearchCV(pipeline, param_grid=parameter_grid, cv=kfold, scoring = scoring3)
grid_search.fit(X, y)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
print('\n')
Bose Sanamchai
  • 105
  • 1
  • 1
  • 6

2 Answers2

2

Use sklearn.pipeline.Pipeline

Demo:

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.33)

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('clf', LogisticRegression())
])

param_grid = [
    {
        'clf__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
        'clf__C': np.logspace(-3, 1, 5),
    },
]

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid.fit(X_train, y_train)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • The console provide error: ValueError: Invalid parameter solver for estimator Pipeline(memory=None, steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False))]). Check the list of available parameters with `estimator.get_params().keys()` – Bose Sanamchai Apr 11 '18 at 13:51
  • @BoseSanamchai, and what does `pipe.get_params().keys()` return? – MaxU - stand with Ukraine Apr 11 '18 at 13:54
  • I update the question with the pipeline code I try to run. Could you examine it? – Bose Sanamchai Apr 11 '18 at 14:27
  • @BoseSanamchai, pay attention at how did i use `param_grid` or, better, simply use my code to understand how does it work... – MaxU - stand with Ukraine Apr 11 '18 at 14:55
0

if you use refit=True than you can use the best model results from the GridSearchCV. you can use the cv_results to find the best row based on rank score. Using the best row then it is possible to extract the parameters. If your feature list becomes large than use RandomSearchCV to make predictions.

 from sklearn.pipeline import Pipeline
 from sklearn.model_selection import train_test_split

 X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3)

 pipe = Pipeline([
     ('scale', StandardScaler()),
     ('clf', LogisticRegression())
 ])

 param_grid = [
    {
    'clf__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'clf__C': np.logspace(-3, 1, 5),
    },
 ]

 grid_class=GridSearchCV(
    estimator=pipeline,
    param_grid=parameter_grid,
    scoring='accuracy',
    n_jobs=4, #use 4 cores
    cv=10, #10 folds
    refit=True,
    return_train_score=True)

    grid_class.fit(X_train,y_train)

    predictions=grid_class.predict(X_test)

    cv_results_df=pd.DataFrame(grid_class.cv_results_)

    best_row=cv_results_df[cv_results_df["rank_test_score"]==1]
 
    print(best_row)

    params_column = cv_results_df.loc[:, ['params']]
    print(params_column)
Golden Lion
  • 3,840
  • 2
  • 26
  • 35