1

I'm trying to cluster a column containing 40.000 rows of string data into x number of clusters based on string similarity. I found DBSCAN the appropriate algorithm, as I do not know the intended number of clusters, or words to compare. Levenshtein is most suitable as a distance metric for clustering based on string similarity. Code below.

I get a memory error when running the algorithm: my RAM blows up. Scalability seems to be the problem with the dbscan. Running the dbscan on 10 strings is already problematic for my laptop, running on 7+/8 GB RAM for this calculation. Some suggest the Ball_Tree index as solution; in the code below you can see I tried, but same memory problem.

I've seen similar problems in different posts. I can find a variation to dbscan, which is the NG-DBSCAN and the dbscan-multiplex, but I can't find a way to implement these methods. Another proposed solution is to use ELKI in Java, but I hope to come to a solution using Python.

I need help with finding a workaround for the memory issue so I can run the dbscan algorithm and cluster the 40k strings in my data file.

This is the code I wrote:

df4 = df1[['Omschrijving2']].copy()
data = df4.values

def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return pylev.levenshtein(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

dbscan(X, eps=2, min_samples=50, metric=lev_metric, metric_params=None, 
algorithm="auto", leaf_size=100, sample_weight=None, n_jobs=1)

or BallTree dbscan:

eps=2
tree = BallTree(X, leaf_size=30, metric=lev_metric)
tree.query_radius(X, r=eps)
Matthijs
  • 11
  • 3

0 Answers0