Text data clustering with python

Question

I am currently trying to cluster a list of sequences based on their similarity using python.

ex:

DFKLKSLFD

DLFKFKDLD

LDPELDKSL
...

The way I pre process my data is by computing the pairwise distances using for example the Levenshtein distance. After calculating all the pairwise distances and creating the distance matrix, I want to use it as input for the clustering algorithm.

I have already tried using Affinity Propagation, but convergence is a bit unpredictable and I would like to go around this problem.

Does anyone have any suggestions regarding other suitable clustering algorithms for this case?

Thank you!!

there's a whole list of algorithm, https://scikit-learn.org/stable/modules/clustering.html, without stating your aim or intended outcome.. i think it's really hard to provide a concrete answer here — StupidWolf, Mar 31 '21 at 08:43
These are some of the targets/conditions that I have: I do not know the number of clusters that I want. I want to get rid of outliers. Cluster the sequences taking into account a maximum distance (i.e. the distance between any pair within a cluster cannot be superior to x). — mantunes, Mar 31 '21 at 10:27

Willem Hendriks · Accepted Answer · 2021-04-01T22:38:26.637

sklearn actually does show this example using DBSCAN, just like Luke once answered here.

This is based on that example, using !pip install python-Levenshtein. But if you have pre-calculated all distances, you could change the custom metric, as shown below.

from Levenshtein import distance

import numpy as np
from sklearn.cluster import dbscan

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def z:
    i, j = int(x[0]), int(y[0])     # extract indices
    return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

dbscan(X, metric=lev_metric, eps=5, min_samples=2)

And if you pre-calculated you could define pre_lev_metric(x, y) along the lines of

def pre_lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return DISTANCES[i,j]

Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.

I had to install like this

!pip uninstall -y enum34
!pip install scikit-learn-extra

Than I was able to create clusters with;

from sklearn_extra.cluster import KMedoids
import numpy as np

from Levenshtein import distance

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)

The labels/centers are in

kmedoids.labels_
kmedoids.cluster_centers_

Thank you a lot for your answer. I managed to implement DBSCAN and the results so far are quite promising. I do have one doubt about the algorithm: given that it selects a random point to begin the clustering, and taking into account that I am passing it the pairwise distances as the basis for the clustering, how much can the clustering final outcome be influenced by the first random point that it selects to start the process? — mantunes, Apr 05 '21 at 07:55
Indeed it can slightly vary. If that is a problem search for deterministic variations of DBSCAN, but in practice usually it is not a problem. Clustering is usually done to get some summary of the data. One edge point belong to cluster A or B doesnt impact the project too much. — Willem Hendriks, Apr 06 '21 at 06:51

score 0 · Answer 2 · answered Apr 01 '21 at 13:34

Try this.

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
    
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

Results:

 - *LDPELDKSL:* LDPELDKSL
 - *DFKLKSLFD:* DFKLKSLFD
 - *XYZ:* ABC, XYZ
 - *DLFKFKDLD:* DLFKFKDLD

score 0 · Answer 3 · answered Feb 24 '22 at 05:16

0

common_words = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

answered Feb 24 '22 at 05:16

Платформа Игр

1
1

Text data clustering with python

3 Answers3