sklearn actually does show this example using DBSCAN, just like Luke once answered here.
This is based on that example, using !pip install python-Levenshtein
.
But if you have pre-calculated all distances, you could change the custom metric, as shown below.
from Levenshtein import distance
import numpy as np
from sklearn.cluster import dbscan
data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
def z:
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)
And if you pre-calculated you could define pre_lev_metric(x, y)
along the lines of
def pre_lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return DISTANCES[i,j]
Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.
I had to install like this
!pip uninstall -y enum34
!pip install scikit-learn-extra
Than I was able to create clusters with;
from sklearn_extra.cluster import KMedoids
import numpy as np
from Levenshtein import distance
data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
def lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return distance(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)
The labels/centers are in
kmedoids.labels_
kmedoids.cluster_centers_