I'm trying to cluster a column containing 40.000 rows of string data into x number of clusters based on string similarity. I found DBSCAN the appropriate algorithm, as I do not know the intended number of clusters, or words to compare. Levenshtein is most suitable as a distance metric for clustering based on string similarity. Code below.
I get a memory error when running the algorithm: my RAM blows up. Scalability seems to be the problem with the dbscan. Running the dbscan on 10 strings is already problematic for my laptop, running on 7+/8 GB RAM for this calculation. Some suggest the Ball_Tree index as solution; in the code below you can see I tried, but same memory problem.
I've seen similar problems in different posts. I can find a variation to dbscan, which is the NG-DBSCAN and the dbscan-multiplex, but I can't find a way to implement these methods. Another proposed solution is to use ELKI in Java, but I hope to come to a solution using Python.
I need help with finding a workaround for the memory issue so I can run the dbscan algorithm and cluster the 40k strings in my data file.
This is the code I wrote:
df4 = df1[['Omschrijving2']].copy()
data = df4.values
def lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return pylev.levenshtein(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, eps=2, min_samples=50, metric=lev_metric, metric_params=None,
algorithm="auto", leaf_size=100, sample_weight=None, n_jobs=1)
or BallTree dbscan:
eps=2
tree = BallTree(X, leaf_size=30, metric=lev_metric)
tree.query_radius(X, r=eps)