Because the dataset in which I wish to find clusters contains a mix of numerical and categorical variables, I'm using the k-prototypes algorithm to compute centroids. By default, the method defines a distance between two data points as a (weighted) sum of (i) the L2 distance between their numerical vector components and (ii) the number of categorical elements on which the two observations differ (matching distance).
This default definition of distance can be modified. In my case, I wish to modify the numerical component of the distance. Specifically, I wish to replace L2 with L1 (Manhattan distance). I do it as follows:
import numpy as np
from kmodes.kprototypes import KPrototypes
def L1(a, b):
return np.sum(np.abs(a-b), axis=1)
model = KPrototypes(n_clusters=20, gamma=1, num_dissim=L1, init='Cao')
Things then work fine when I call the instance's fit method: centroids are found, and clusters are built. However, when I try to save the model fit as a pickle file I get the following error:
_pickle.PicklingError: Can't pickle <function L1 at 0x7f43d3306d08>: attribute lookup L1 on __main__ failed
Based on this thread Python multiprocessing PicklingError: Can't pickle <type 'function'> and on the fact that the model is saved without trouble when using the default distance function (L2 + matching), I suspect that the error might be due to L1 being a custom function, i.e. not a part of the Python module. I've looked into the source code of the package but couldn't find an implementation of the Manhattan distance. Am I missing something? How come such a commonly used distance isn't part of the module?