I am trying to implement a custom distance metric for clustering. The code snippet looks like:
import numpy as np
from sklearn.cluster import KMeans, DBSCAN, MeanShift
def distance(x, y):
# print(x, y) -> This x and y aren't one-hot vectors and is the source of this question
match_count = 0.
for xi, yi in zip(x, y):
if float(xi) == 1. and xi == yi:
match_count += 1
return match_count
def custom_metric(x, y):
# x, y are two vectors
# distance(.,.) calculates count of elements when both xi and yi are True
return distance(x, y)
vectorized_text = np.stack([[1, 0, 0, 1] * 100,
[1, 1, 1, 0] * 100,
[0, 1, 1, 0] * 100,
[0, 0, 0, 1] * 100] * 100)
dbscan = DBSCAN(min_samples=2, metric=custom_metric, eps=3, p=1).fit(vectorized_text)
The vectorized_text
is a one-hot encoded feature matrix of size n_sample x n_features
. But when custom_metric
is being called, one of x
or y
turns to be a real valued vector and other one remains the one-hot vector. Expectedly, both x
and y
should have been one-hot vector. This is causing the custom_metric to return wrong results during run-time and hence clustering is not as correct.
Example of x
and y
in distance(x, y)
method:
x = [0.5 0.5 0.5 ... 0.5 0.5]
y = [0. 0. 0. 1. 0. 0. ... 1. 0.]
Both should have been one-hot vectors.
Does anyone have an idea to go about this situation?