I am trying to tune gamma
parameter of precomputed RBF
kernel using gridsearchCV()
and Pipeline
in scikit-learn
. I followed the explanation in the following two StackOverflow
links:
- Is it possible to tune parameters with grid search for custom kernels in scikit-learn?
- how to tune parameters of custom kernel function with pipeline in scikit-learn
However, these two links show examples of using Sklearn's
inbuilt chi2_kernel
and rbf_kernel
functions, while I am interested in writing my own Gram matrix kernel as shown in my minimum working example code below.
Please note that I have intentionally written Train
and Test
sets in def main()
function body because of the complexity of my original problem; in which I will be having a for loop for loading multiple datasets from a directory in order to solve a binary one-vs-one classification problem. Therefore, I want to keep these Train
and Test
datasets in the main function body. And I must also compute Gram matrices G_Train
and G_Test
separately (not in a single step) as I am computing in my example code.
One can replace my dummy dataset by Iris
or any other dataset.
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from scipy.spatial.distance import cdist
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import ParameterGrid
import sklearn
import sys
class myKernel(BaseEstimator, TransformerMixin):
def __init__(self, Train, Test, gamma=1.0):
super(myKernel,self).__init__()
self.gamma = gamma
self.Train = Train
self.Test = Test
def fit(self, **fit_params):
return self
def transform(self):
gamma = self.gamma
Train = self.Train
Test = self.Test
G_Train = np.exp(-gamma*np.square(cdist(Train,Train, 'euclidean')))
G_Test = np.exp(-gamma*np.square(cdist(Test, Train, 'euclidean')))
return G_Train, G_Test
def main():
print('python: {}'.format(sys.version))
print('numpy: {}'.format(np.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print()
np.random.seed(0)
Train = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]])
Test = np.array([[4, 5, 6],[0, 1, 0], [1, 2, 1], [0, 4, 1]])
Train_label = [1, 1, 1, 0, 0]
Test_label = [0, 0, 1, 1]
my_kernel = myKernel(Train, Test)
svm = SVC(kernel='precomputed')
pipe = Pipeline(steps=[('svm', svm)])
p = [{'svm__C': [[1, 10]], 'svm__gamma': [[0.01, 0.1]]}]
parameter = ParameterGrid(p)
parameter = np.ravel(parameter)
clf = GridSearchCV(pipe, parameter, n_jobs=-1, cv=2, refit='True')
G_Train, G_Test = my_kernel.transform()
print(clf.fit(G_Train, Train_label))
#Best parameters
print('\nBest Parameters: ', clf.best_params_)
print('\npredicted labels: ', clf.best_estimator_.predict(G_Test))
print("\nAccuracy on test set: {:.2f}%\n".format((clf.score(G_Test, Test_label))*100))
if __name__ == '__main__':
main()
Parameter C
can be tuned without any problem, however, I notice that only the first values of parameter gamma
shows as the best parameter found. In above example, I get the following best parameters: C = 1, gamma = 0.01
. No matter what values of C
& gamma
I add in p
, I always get only the first values of gamma
in the sequence. Here is the output of above code:
Output:
python: 3.5.2 |Anaconda custom (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
numpy: 1.13.1
sklearn: 0.19.0
GridSearchCV(cv=2, error_score='raise',
estimator=Pipeline(memory=None,
steps=[('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto',
kernel='precomputed', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))]),
fit_params=None, iid=True, n_jobs=-1,
param_grid=array([{'svm__gamma': [0.01, 0.1], 'svm__C': [1, 10]}], dtype=object),
pre_dispatch='2*n_jobs', refit='True', return_train_score=True,
scoring=None, verbose=0)
Best Parameters: {'svm__gamma': 0.01, 'svm__C': 1}
predicted labels: [1 1 1 1]
Accuracy on test set: 50.00%
I will appreciate any advise.