22

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as k-means algorithms won't work because I do not know the number of clusters.

I am facing some problems using Scikit-learn's implementation of dbscan.

This snippet below works on small datasets in the format I an using, but since it is precomputing the entire distance matrix, that takes O(n^2) space and time and is way too much for my large datasets. I have run this for many hours but it just ends up taking all the memory of my PC.

lev_similarity = -1*np.array([[distance.levenshtein(w1[0],w2[0]) for w1 in words] for w2 in words])
dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1)
dbscan.fit(lev_similarity)

So I figured I needed some way to compute the similarity on the fly and hence tried this method.

dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1, metric = distance.levenshtein)
dbscan.fit(words)

But this method ends up giving me an error:

ValueError: could not convert string to float: URL

Which I realize means that its trying to convert the inputs to the similarity function to floats. But I don't want it to do that. As far as I understand, it just needs a function that can take two arguments and return a float value that it can then compare to eps, which the levenshtein distance should do.

I am stuck at this point, as I do not know the implementation details of sklearn's dbscan to find why it is trying to convert it to float, and neither do I have any better idea on how to avoid the O(n^2) matrix computation.

Please let me know if there is any better or faster way to cluster these many strings that I may have overlooked.

Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
KaziJehangir
  • 295
  • 1
  • 3
  • 9

2 Answers2

18

From the scikit-learn faq you can do this by making a custom metric:

from leven import levenshtein       
import numpy as np
from sklearn.cluster import dbscan
data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return levenshtein(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)
Luke
  • 6,699
  • 13
  • 50
  • 88
  • 8
    What does the dbscan method call return? More specifically, I ran this snippet in the Python shell and got a tuple of arrays (array([0, 1]), array([ 0, 0, -1])) and I am wondering what this represents. – Sticky Mar 29 '17 at 02:01
  • So DBSCAN and dbscan are actually two different imports, which threw me off for a while reading docs. The latter, used here, returns a tuple of `(core_samples, labels)`. `labels` is probably what we're interested in. `-1` indicates a "noisy" sample which I believe means "not part of a cluster". Any other number indicates the cluster it's in. So to take this example, the first two elements of `data` belong together. The last element belongs to nothing. – Moo Oct 21 '22 at 15:31
4

Try ELKI instead of sklearn.

It is the only tool I know that allows index accelerated DBSCAN with any metric.

It includes Levenshtein distance. You need to add an index to your database with -db.index. I always use the cover tree index (you need to choose the same distance for the index and for the algorithm, of course!)

You could use "pyfunc" distances and ball trees in sklearn, but performance was really bad because of the interpreter. Also, DBSCAN in sklearn is much more memory intensive.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I tried ELKI but I got stuck on its input format. I can't find much information on its website. It would be great if you could point me in the right direction or give a link to a complete end to end tutorial about ELKI's dbscan. Thanks. – KaziJehangir Aug 02 '16 at 22:13
  • There are multiple parsers. Use the JavaDoc, the input formats are explained there. – Has QUIT--Anony-Mousse Aug 03 '16 at 08:11