2

I have a question about clustering the words into number of groups/clusters based on their similarity.. The similarity actually is semantic similarity gauged by using WordNet lexical database.. After extracting and calculating the semantic similarity I got n*n symmetric matrix in the form:

    A       B       C     D
A   1      0.2     0.5    0.0
B   0.2    1       0.0    0.3
C   0.5    0.0     1      0.8
D   0.0    0.3     0.8    1

the matrix is constructed from 10 thousands of words extracted from large scale dataset..

My question is (which is simple because of my poor knowledge about clustering), what is the appropriate clustering technique that can be used for this purpose? Is their any java tool to do this, or even toolkit within excel so can directly use?tutorial? Any help is sincerely appreciated..

sam_salem
  • 21
  • 5

2 Answers2

1

There are a variety of hierarchical clustering techniques, most of which would probably be appropriate for your problem. The two main types are bottom-up (agglomerative) and top-down (divisive). In general, bottom-up is easier to implement and so is used more frequently, while top-down can be implemented in a more efficient way. Unless speed is a major issue for you, you're probably better off going with bottom-up, especially since you're new to this. There are certain circumstances where one approach may outperform the other, but that tends to be pretty dependent on your exact data.

The primary other feature that distinguishes different algorithms is how the distance between clusters is calculated. Note that most clustering algorithms use distance, whereas you have similarity values. Distance and similarity are essentially inverses of each other, so you can get distance with 1/(similarity+1), or, because your similarities are all less than or equal to 1, you could compute distance as 1 - similarity. The most commonly used techniques for calculating distances between clusters are:

  • "Average link" - The distance between the centers of the two clusters. This can be calculated by taking the average of the distances between every pair of members of the two groups.
  • "Single link" - The distance between the two closest members of the two clusters
  • "Complete link" - The distance between the two most distant members of the two clusters.

UPGMA is a classic example of such an algorithm, and there are a variety of java implementations floating around if you search google (I can't comment on which are better or worse, as I don't really know Java). Multidendrograms and HAC are both Java implementations of more general hierarchical agglomerative clustering.

Community
  • 1
  • 1
seaotternerd
  • 6,298
  • 2
  • 47
  • 58
1

K nearest neighbors produces really good clusters for co-occurrence matrices such as the one you have there. I put together a quick blog post on clustering semantically similar words using K mean, but here's the quick code:

from __future__ import division
from sklearn.cluster import KMeans
from numbers import Number
from pandas import DataFrame
import sys, codecs, numpy

class autovivify_list(dict):
        '''Pickleable class to replicate the functionality of collections.defaultdict'''
        def __missing__(self, key):
                value = self[key] = []
                return value

        def __add__(self, x):
                '''Override addition for numeric types when self is empty'''
                if not self and isinstance(x, Number):
                        return x
                raise ValueError

        def __sub__(self, x):
                '''Also provide subtraction method'''
                if not self and isinstance(x, Number):
                        return -1 * x
                raise ValueError

def build_word_vector_matrix(vector_file, n_words):
        '''Iterate over the GloVe array read from sys.argv[1] and return its vectors and labels as arrays'''
        numpy_arrays = []
        labels_array = []
        with codecs.open(vector_file, 'r', 'utf-8') as f:
                for c, r in enumerate(f):
                        sr = r.split()
                        labels_array.append(sr[0])
                        numpy_arrays.append( numpy.array([float(i) for i in sr[1:]]) )

                        if c == n_words:
                                return numpy.array( numpy_arrays ), labels_array

        return numpy.array( numpy_arrays ), labels_array

def find_word_clusters(labels_array, cluster_labels):
        '''Read in the labels array and clusters label and return the set of words in each cluster'''
        cluster_to_words = autovivify_list()
        for c, i in enumerate(cluster_labels):
                cluster_to_words[ i ].append( labels_array[c] )
        return cluster_to_words

if __name__ == "__main__":

        input_vector_file = sys.argv[1]
        n_words           = int(sys.argv[2])
        reduction_factor  = float(sys.argv[3])
        clusters_to_make  = int( n_words * reduction_factor )
        df, labels_array  = build_word_vector_matrix(input_vector_file, n_words)
        kmeans_model      = KMeans(init='k-means++', n_clusters=clusters_to_make, n_init=10)
        kmeans_model.fit(df)

        cluster_labels    = kmeans_model.labels_
        cluster_inertia   = kmeans_model.inertia_
        cluster_to_words  = find_word_clusters(labels_array, cluster_labels)

        for c in cluster_to_words:
                print cluster_to_words[c]

If you save this script as cluster_vectors.py, you can run:

wget http://www-nlp.stanford.edu/data/glove.6B.300d.txt.gz
gunzip glove.6B.300d.txt.gz
python cluster_vectors.py glove.6B.300d.txt 10000 .1

To read the first 10000 lines of the GloVe word vectors (semantic word vectors inferred from term cooccurrence) and cluster those words into 10000 * .1 = 1000 clusters. The clusters will look something like this:

[u'Chicago', u'Boston', u'Houston', u'Atlanta', u'Dallas', u'Denver', u'Philadelphia', u'Baltimore', u'Cleveland', u'Pittsburgh', u'Buffalo', u'Cincinnati', u'Louisville', u'Milwaukee', u'Memphis', u'Indianapolis', u'Auburn', u'Dame']

[u'Product', u'Products', u'Shipping', u'Brand', u'Customer', u'Items', u'Retail', u'Manufacturer', u'Supply', u'Cart', u'SKU', u'Hardware', u'OEM', u'Warranty', u'Brands']

[u'home', u'house', u'homes', u'houses', u'housing', u'offices', u'household', u'acres', u'residence']

[...]

[u'Night', u'Disney', u'Magic', u'Dream', u'Ultimate', u'Fantasy', u'Theme', u'Adventure', u'Cruise', u'Potter', u'Angels', u'Adventures', u'Dreams', u'Wonder', u'Romance', u'Mystery', u'Quest', u'Sonic', u'Nights']

I hope this helps!

duhaime
  • 25,611
  • 17
  • 169
  • 224
  • 1
    *"K nearest neighbors produces really good clusters for co-occurrence matrices such as the one you have there."* I think you meant K-means, not K-nearest neighbours? – Stef Apr 25 '22 at 15:28