Kmeans clustering of text data with percentage match

Question

I am having hundreds of large strings and would want to cluster them into groups (clusters). I found kmeans as one way to do this. But my problem is that it takes only the number of clusters as an argument. But my requirement is to take the percentage match between strings as an argument and cluster only those strings into different clusters, which meet or exceed that criteria. For example, if strings 1 & 2 match >90%, then only I want them in a cluster. The ones which do not match can be put in single element clusters. Is there a way to do this in R r Python or any language?

You may need to look at a different clustering technique such as Hierarchical clustering (hclust) and then cutree at the desired thershold. — Dave2e, Apr 07 '16 at 12:11
Thanks Dave.. any pointers to some good R libraries for this? — G Narayan, Apr 07 '16 at 16:37
Maybe this question will help: http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters. Also google clustering with R for some additional resources. — Dave2e, Apr 07 '16 at 17:30

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

Clustering algorithm

k-means

As its name suggest, k-means will try to make k clusters, and will use for center of the cluster the mean of all values in the cluster. You then update the position of your centers, attribute element to the closest center, and repeat until it does not change anymore. As you can see, all you need is to define the number of centers (and their starting points, but often this is randomized and repeated many times).

Your classification

What you want is to cluster words that are very similar to one another based on a threshold. You could always do that by computing the distance between elements (the distance being your similarity). The pseudo-code for that would be:

1) initialize cluster with first word
2) add all words to cluster that are "close enough" to this word
3) pick a word that has not been clustered yet, and initialize a new cluster with it
4) add all words "close enough" to this word
5) repeat 3 and 4 until all words are used

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 07 '16 at 12:08

DeveauP

1,217
11
21

Thanks a lot. this really helps.. any good built-in libraries to detect this close enough match, by specifying percentage? Or any function to do this kind? – G Narayan Apr 07 '16 at 16:36
I do not know of a package that does exactly this, yet it is very similar to hierarchical clustering (package hclust). The function for which I wrote the pseudo-code should not be very long to implement. – DeveauP Apr 08 '16 at 11:41
I have implemented this in python, but it is taking a very long time.. i will post that in a different question.. – G Narayan Apr 17 '16 at 14:19

Kmeans clustering of text data with percentage match

1 Answers1

Clustering algorithm

k-means

Your classification