DBSCAN: clustering large set of strings in Python

Question

I'm trying to cluster a column containing 40.000 rows of string data into x number of clusters based on string similarity. I found DBSCAN the appropriate algorithm, as I do not know the intended number of clusters, or words to compare. Levenshtein is most suitable as a distance metric for clustering based on string similarity. Code below.

I get a memory error when running the algorithm: my RAM blows up. Scalability seems to be the problem with the dbscan. Running the dbscan on 10 strings is already problematic for my laptop, running on 7+/8 GB RAM for this calculation. Some suggest the Ball_Tree index as solution; in the code below you can see I tried, but same memory problem.

I've seen similar problems in different posts. I can find a variation to dbscan, which is the NG-DBSCAN and the dbscan-multiplex, but I can't find a way to implement these methods. Another proposed solution is to use ELKI in Java, but I hope to come to a solution using Python.

I need help with finding a workaround for the memory issue so I can run the dbscan algorithm and cluster the 40k strings in my data file.

This is the code I wrote:

df4 = df1[['Omschrijving2']].copy()
data = df4.values

def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return pylev.levenshtein(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

dbscan(X, eps=2, min_samples=50, metric=lev_metric, metric_params=None, 
algorithm="auto", leaf_size=100, sample_weight=None, n_jobs=1)

or BallTree dbscan:

eps=2
tree = BallTree(X, leaf_size=30, metric=lev_metric)
tree.query_radius(X, r=eps)

http://scikit-learn.org/stable/modules/clustering.html#dbscan there is a note about "Memory consumption for large sample sizes" and suggestion to use Birch http://scikit-learn.org/stable/modules/clustering.html#birch — Alex Bender, May 01 '18 at 12:52
Try with a subset first. Because I don't it will work. Levenshtein is too crude, and short words will connect everything into one big cluster. All Levenshtein distance 2: Trump - trim - tree - the - then - her - sheer. This will likely all merge. — Has QUIT--Anony-Mousse, May 01 '18 at 23:26
That is also likely why sklearn's DBSCAN blows up to O(n²) memory. — Has QUIT--Anony-Mousse, May 01 '18 at 23:27

DBSCAN: clustering large set of strings in Python

or BallTree dbscan:

0 Answers0