1

I am trying to tune gamma parameter of precomputed RBF kernel using gridsearchCV() and Pipeline in scikit-learn. I followed the explanation in the following two StackOverflow links:

  1. Is it possible to tune parameters with grid search for custom kernels in scikit-learn?
  2. how to tune parameters of custom kernel function with pipeline in scikit-learn

However, these two links show examples of using Sklearn's inbuilt chi2_kernel and rbf_kernel functions, while I am interested in writing my own Gram matrix kernel as shown in my minimum working example code below.

Please note that I have intentionally written Train and Test sets in def main() function body because of the complexity of my original problem; in which I will be having a for loop for loading multiple datasets from a directory in order to solve a binary one-vs-one classification problem. Therefore, I want to keep these Train and Test datasets in the main function body. And I must also compute Gram matrices G_Train and G_Test separately (not in a single step) as I am computing in my example code.

One can replace my dummy dataset by Iris or any other dataset.

import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from scipy.spatial.distance import cdist
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import ParameterGrid
import sklearn
import sys

class myKernel(BaseEstimator, TransformerMixin):
    def __init__(self, Train, Test, gamma=1.0):
        super(myKernel,self).__init__()
        self.gamma = gamma
        self.Train = Train
        self.Test = Test

    def fit(self, **fit_params):
        return self

    def transform(self):
        gamma = self.gamma
        Train = self.Train
        Test = self.Test        

        G_Train = np.exp(-gamma*np.square(cdist(Train,Train, 'euclidean')))
        G_Test = np.exp(-gamma*np.square(cdist(Test, Train, 'euclidean'))) 
        return G_Train, G_Test

def main():   

    print('python: {}'.format(sys.version))
    print('numpy: {}'.format(np.__version__))
    print('sklearn: {}'.format(sklearn.__version__))
    print()
    np.random.seed(0)

    Train = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]])
    Test = np.array([[4, 5, 6],[0, 1, 0], [1, 2, 1], [0, 4, 1]])

    Train_label = [1, 1, 1, 0, 0]
    Test_label = [0, 0, 1, 1]

    my_kernel = myKernel(Train, Test)
    svm = SVC(kernel='precomputed')
    pipe = Pipeline(steps=[('svm', svm)])

    p = [{'svm__C': [[1, 10]], 'svm__gamma': [[0.01, 0.1]]}]  
    parameter = ParameterGrid(p)  
    parameter = np.ravel(parameter)

    clf = GridSearchCV(pipe, parameter, n_jobs=-1, cv=2, refit='True')

    G_Train, G_Test = my_kernel.transform() 

    print(clf.fit(G_Train, Train_label))

    #Best parameters
    print('\nBest Parameters: ', clf.best_params_)

    print('\npredicted labels: ', clf.best_estimator_.predict(G_Test))
    print("\nAccuracy on test set: {:.2f}%\n".format((clf.score(G_Test, Test_label))*100))

if __name__ == '__main__':
    main()

Parameter C can be tuned without any problem, however, I notice that only the first values of parameter gamma shows as the best parameter found. In above example, I get the following best parameters: C = 1, gamma = 0.01. No matter what values of C & gamma I add in p, I always get only the first values of gamma in the sequence. Here is the output of above code:

Output:

python: 3.5.2 |Anaconda custom (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
numpy: 1.13.1
sklearn: 0.19.0

GridSearchCV(cv=2, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto',
  kernel='precomputed', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid=array([{'svm__gamma': [0.01, 0.1], 'svm__C': [1, 10]}], dtype=object),
       pre_dispatch='2*n_jobs', refit='True', return_train_score=True,
       scoring=None, verbose=0)

Best Parameters:  {'svm__gamma': 0.01, 'svm__C': 1}

predicted labels:  [1 1 1 1]
Accuracy on test set: 50.00%

I will appreciate any advise.

Hello World
  • 21
  • 1
  • 12
  • This happens for you because all parameter options produce the exact same score, so the first one is selected. Try reversiong the values like p = `[{'svm__C': [[10, 1]], 'svm__gamma': [[0.1, 0.01]]}]`. And you will get 10, 0.1 (First combination). Anyways I am not able to duplicate this behaviour for iris data. I am specifying `'svm__C': [100, 1, 10], 'svm__gamma': [1.0, 0.01, 0.1]` and getting best parameters `{'svm__C': 1, 'svm__gamma': 1.0}`. So your assumption about this is wrong. – Vivek Kumar Oct 11 '17 at 08:02
  • @VivekKumar You are partially right about parameter `C`, but not about`gamma`. You will see that no matter what value of `gamma` you choose, `best parameter found` will always be the **first** values in the list. In fact, my question specifically is about tuning `gamma`. – Hello World Oct 11 '17 at 14:53
  • Sorry for late reply. I saw that you never came back after this question, until yesterday. Hence posting this now (thanks to @zxmnb). The `gamma` doesnt change because SVC dont use `gamma` when kernel is `precomputed` (custom). So gamma supplied to GridSearchCV will not affect the scores in any way. You can yourself see that in [the documentation](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). The gamma in `myKernel` is set to 1.0 which is not related to `gamma` of `SVC` in any way. So not tuned by GridSearch. – Vivek Kumar Jun 08 '18 at 04:38
  • @VivekKumar Sorry your comment is not helpful. Can you provide a working example? – Hello World Jun 21 '18 at 02:18

0 Answers0