18

I have a custom distance metric that I need to use for KNN, K Nearest Neighbors.

I tried following this, but I cannot get it to work for some reason.

I would assume that the distance metric is supposed to take two vectors/arrays of the same length, as I have written below:

import sklearn 
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

def d(a,b,L):
    # Inputs: a and b are rows from a data matrix   
    return a+b+2+L

knn=NearestNeighbors(n_neighbors=1,
                 algorithm='auto',
                 metric='pyfunc',
                 func=lambda a,b: d(a,b,L)
                 )


X=pd.DataFrame({'b':[0,3,2],'c':[1.0,4.3,2.2]})
knn.fit(X)

However, when I call: knn.kneighbors(), it doesn't seem to like the custom function. Here is the bottom of the error stack:

ValueError: Unknown metric pyfunc. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski'], or 'precomputed', or a callable

However, I see the exact same in the question I cited. Any ideas on how to make this work on sklearn version 0.14? I'm not aware of any differences in the versions.

Thanks.

Community
  • 1
  • 1
makansij
  • 9,303
  • 37
  • 105
  • 183
  • also your distance function is no good, it will return a vector, wheras it needs to return a single value – maxymoo Dec 22 '15 at 03:51

1 Answers1

14

The documentation is actually pretty clear on the use of the metric argument:

metric : string or callable, default ‘minkowski’

metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy’s metrics, but is less efficient than passing the metric name as a string.

Thus (as also per the error message), metric should be a callable, not a string. And it should accept two arguments (arrays), and return one. Which is your lambda function.

Thus, your code can be simplified to:

import sklearn
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

def d(a,b,L):
    return a+b+2+L

knn=NearestNeighbors(n_neighbors=1,
                 algorithm='auto',
                 metric=lambda a,b: d(a,b,L)
                 )
X=pd.DataFrame({'b':[0,3,2],'c':[1.0,4.3,2.2]})
knn.fit(X)
  • Thank you. The documentation I saw was [here](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html), neither of which are as detailed as what you cited. Thank you. – makansij Dec 22 '15 at 05:20
  • 1
    I used the following code. Its giving me pickling error.Can you help me with this? My code : def dist2(a,b): return jaccard(a,b) knnobj = NearestNeighbors(n_neighbors=6, algorithm='auto',metric=lambda a,b: dist2(a,b)).fit(my_Data) PicklingError: Can't pickle : attribute lookup __builtin__.function failed – csalive Jan 15 '18 at 14:32