5

I am trying to implement a custom distance metric for clustering. The code snippet looks like:

import numpy as np
from sklearn.cluster import KMeans, DBSCAN, MeanShift

def distance(x, y):
    # print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
    match_count = 0.
    for xi, yi in zip(x, y):
        if float(xi) == 1. and xi == yi:
            match_count += 1
    return match_count

def custom_metric(x, y):
    # x, y are two vectors
    # distance(.,.) calculates count of elements when both xi and yi are True
    return distance(x, y)


vectorized_text = np.stack([[1, 0, 0, 1] * 100,
                            [1, 1, 1, 0] * 100,
                            [0, 1, 1, 0] * 100,
                            [0, 0, 0, 1] * 100] * 100)

dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(vectorized_text)

The vectorized_text is a one-hot encoded feature matrix of size n_sample x n_features. But when custom_metric is being called, one of x or y turns to be a real valued vector and other one remains the one-hot vector. Expectedly, both x and y should have been one-hot vector. This is causing the custom_metric to return wrong results during run-time and hence clustering is not as correct.

Example of x and y in distance(x, y) method:

x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]

Both should have been one-hot vectors.

Does anyone have an idea to go about this situation?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
user3480922
  • 564
  • 1
  • 10
  • 22

3 Answers3

2

First of all, your distance is wrong.

Distances must return small values for similar vectors. You have defined a similarity, not a distance.

Secondly, using naive python code such as zip will perform extremely poor. Python just does not optimize such code well, it will do all the work in the slow interpreter. Python speed is only okay if you vectorize everything. And in fact, this code can be vectorised trivially, and then it likely won't even matter whether your inputs are binary or float data. What you are computing in a very complicated fashion is nothing but the dot product of two vectors, isn't it?

This, your distance should probably look like this:

def distance(x, y):
  return x.shape[0] - np.dot(x,y)

Or whatever distance transformation you intend to use.

Now for your actual problem: my guess is that sklearn tries to accelerate your distance with a ball tree. That won't help much because of the poor performance of Python interpreter callbacks (in fact, you should probably precompute the entire distance matrix in one vectorised operation - something like dist = dim - X.transpose().dot(X)? Do the math yourself to figure out the equation). Other languages such as Java (e.g., the ELKI tool) are much better to extend this way, because of the way the hotspot JIT compiler can optimize and inline such calls everywhere.

To test the hypothesis that the sklearn ball-tree is the cause for the odd values you are observing, try setting method="brute" or so (see the documentation) to disable the ball tree. But in the end, you'll want to either precompute the entire distance matrix (if you can afford O(n²) cost), or switch to a different programming language (implementing your distance in Cython for example helps, but you'll still likely see the data being numpy float arrays suddenly).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks @Anony-Mousse for the comment. One question: 1. How do I know what what is a `small` value and what is not? Is there any particular definition as per standard? – user3480922 Sep 23 '19 at 05:11
  • 1
    Identical vectors must return a small value of 0, by the definition of a distance... And the maximally different vectors should have the largest distance, much larger than 0. For example 1 or 1000. The point is that your function is completely off. You return distance 0 for vectors with nothing in common. Hence it is not a distance. – Has QUIT--Anony-Mousse Sep 23 '19 at 05:45
  • 1
    You can use GeneralizedDBSCAN with such similarity function, but that is not in sklearn. The only implementation of DBSCAN for similarities that I know is in ELKI: [SimilarityNeighborPredicate](https://elki-project.github.io/releases/current/doc/de/lmu/ifi/dbs/elki/algorithm/clustering/gdbscan/SimilarityNeighborPredicate.html) – Has QUIT--Anony-Mousse Sep 23 '19 at 05:51
1

I don't get your question, if I have:

x = [1, 0, 1]
y = [0, 0, 1]

and I use:

def distance(x, y):
    # print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
    match_count = 0.
    for xi, yi in zip(x, y):
        if float(xi) == 1. and xi == yi:
            match_count += 1
    return match_count

print(distance(x, y))
 1.0

and on top if you print x, y now:

x
[1, 0, 1]
y
[0, 0, 1]

so it is working?

PV8
  • 5,799
  • 7
  • 43
  • 87
  • That is exactly the question, when using this custom metric, the vectors x and y are not as per expectation when distance is called in the pipeline of clustering. A similar question: https://stackoverflow.com/questions/41863635/sklearn-kneighborsregressor-custom-distance-metrics – user3480922 Sep 20 '19 at 08:29
  • @user3480922 what do you mean "*the vectors x and y are not as per expectation*? x and y are the **inputs** to the functions, and they are not *defined* by it; your question is completely unclear – desertnaut Sep 21 '19 at 02:28
  • @desertnaut I understand your confusion, but if you try to reproduce the error, you might be able to see the discrepancy. – user3480922 Sep 23 '19 at 05:12
0

I reproduced your code and I do get your error. I explain it better here:

He has a vectorized_text variable (np.stack) which simulates a One Hot Encoded feature set (only contains 0s and 1s). And in the DBSCAN model, he uses a custom_metric function to calculate the distance. It is expected that when the model is run, the custom metric function takes as parameters pairs of observations as they are: One Hot encoded values, but instead when printing those values inside the distance function, only one is taken as it is, and the other one appears to be a list of real values as he described in the question:

x = [0.5 0.5 0.5 ... 0.5 0.5] y = [0. 0. 0. 1. 0. 0. ... 1. 0.]

Anyway, when I pass lists to the fit parameter, the function obtains the values as they are:

from sklearn.cluster import KMeans, DBSCAN, MeanShift

x = [1, 0, 1]
y = [0, 0, 1]
feature_set = [x*5]*5
def distance(x, y):
    # Printing here the values. Should be 0s and 1s
    print(x, y)
    match_count = 0.
    for xi, yi in zip(x, y):
        if float(xi) == 1. and xi == yi:
            match_count += 1
    return match_count

def custom_metric(x, y):
    # x, y are two vectors
    # distance(.,.) calculates count of elements when both xi and yi are True
    return distance(x, y)

dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(feature_set)`

Result:

[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]
[1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1.] ... [1. 0. 1. 1. 0.1. 1. 0. 1. 1. 0. 1. 1. 0. 1.]

I suggest you to use a pandas DataFrame or some other type of value and see if it works.

GusSL
  • 652
  • 7
  • 23