2

Possible Duplicate:
Python k-means algorithm

I want to cluster 10000 indexed points based on their feature vectors and get their ids after clustering i.e. cluster1:[p1, p3, p100, ...], cluster2:[...] ...

Is there any way to do this in Python? Thx~

P.s. The indexed points are stored in a 10000*10 matrix, where each row represents a feature vector.

Community
  • 1
  • 1
zhongqi
  • 1,127
  • 2
  • 11
  • 11
  • Can you please add an example for people like us who don;t know about clustering but knows Python a bit :-) – Abhijit Dec 22 '11 at 17:40
  • 1
    Uh -- have you seen [this](http://stackoverflow.com/q/1545606/21475)? – Cameron Dec 22 '11 at 17:40
  • 1
    @Abhijit: See the [K-means algorithm](http://en.wikipedia.org/wiki/K-means_clustering). Basically, you have a bunch of points in n-dimensional space and you want to automatically find "clusters" of points (i.e. group them based on some similarity). K-means chooses K random starting points (seeds), then partitions all the points based on which seed is closest to each point. The centroid of each cluster is then calculated, and iterations ensue until the centroids stop moving. – Cameron Dec 22 '11 at 17:44
  • @Cameron My problem is not quite the same with the ["Python k-means algorithm"](http://stackoverflow.com/questions/1545606/python-k-means-algorithm) post. I will need to **get the point ID** after clustering. So I would like some suggestions on how to associate the point ID to the features. Thx~ – zhongqi Dec 23 '11 at 05:00
  • Thanks all. My solution would be 1. run K means and get the cluster centers. 2. Recompute the distances between each points and the cluster centers, so that I could record the Ids associated with clusters. – zhongqi Dec 23 '11 at 05:24
  • In fact, [scipy.cluster.vq.kmeans2](http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html#scipy.cluster.vq.kmeans2) is exactly what I need. – zhongqi Dec 23 '11 at 08:06
  • this is obviously *not* an exact duplicate of http://stackoverflow.com/questions/1545606/python-k-means-algorithm – K.-Michael Aye Dec 05 '12 at 19:48

1 Answers1

2

Use some clustering algorithm - I've included an implementation of the K-means algorithm that @Cameron linked to in his second comment, but you might want to refer to the link in his first comment. I'm not sure what you mean by get their ID's, could you elaborate?

from math import sqrt

def k_means(data_pts, k=None):
    """ Return k (x,y) pairs where:
            k = number of clusters
        and each
            (x,y) pair = centroid of cluster

        data_pts should be a list of (x,y) tuples, e.g.,
            data_pts=[ (0,0), (0,5), (1,3) ]
    """

    """ Helper functions """
    def lists_are_same(la, lb): # see if two lists have the same elements
        out = False
        for item in la:
            if item not in lb:
                out = False
                break
            else:
                out = True
        return out  
    def distance(a, b): # distance between (x,y) points a and b
        return sqrt(abs(a[0]-b[0])**2 + abs(a[1]-b[1])**2)
    def average(a): # return the average of a one-dimensional list (e.g., [1, 2, 3])
        return sum(a)/float(len(a))

    """ Set up some initial values """
    if k is None: # if the user didn't supply a number of means to look for, try to estimate how many there are
        n = len(data_pts)# number of points in the dataset
        k = int(sqrt(n/2))  # number of clusters - see
                        #   http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#Rule_of_thumb
    if k < 1: # make sure there's at least one cluster
        k = 1



    """ Randomly generate k clusters and determine the cluster centers,
        or directly generate k random points as cluster centers. """

    init_clusters = data_pts[:]         # put all of the data points into clusters
    shuffle(init_clusters)          # put the data points in random order
    init_clusters = init_clusters[0:k]  # only keep the first k random clusters

    old_clusters, new_clusters = {}, {} 
    for item in init_clusters:
        old_clusters[item] = [] # every cluster has a list of points associated with it. Initially, it's 0

    while 1: # just keep going forever, until our break condition is met
        tmp = {}
        for k in old_clusters: # create an editable version of the old_clusters dictionary
            tmp[k] = []

        """ Associate each point with the closest cluster center. """
        for point in data_pts: # for each (x,y) data point
            min_clust = None
            min_dist = 1000000000 # absurdly large, should be larger than the maximum distance for most data sets
            for pc in tmp: # for every possible closest cluster
                pc_dist = distance(point, pc)
                if pc_dist < min_dist: # if this cluster is the closest, have it be the closest (duh)
                    min_dist = pc_dist
                    min_clust = pc
            tmp[min_clust].append(point) # add each point to its closest cluster's list of associated points

        """ Recompute the new cluster centers. """
        for k in tmp:
            associated = tmp[k]
            xs = [pt[0] for pt in associated] # build up a list of x's
            ys = [pt[1] for pt in associated] # build up a list of y's
            x = average(xs) # x coordinate of new cluster
            y = average(ys) # y coordinate of new cluster
            new_clusters[(x,y)] = associated # these are the points the center was built off of, they're *probably* still associated

        if lists_are_same(old_clusters.keys(), new_clusters.keys()): # if we've reached equilibrium, return the points
            return old_clusters.keys()
        else: # otherwise, we'll go another round. let old_clusters = new_clusters, and clear new_clusters.
            old_clusters = new_clusters
            new_clusters = {}
Peter Downs
  • 482
  • 1
  • 4
  • 14
  • I have the points stored in a 10000*10 matrix, where 10000 is the number of the points and 10 is the number of features. First we can do the K means clustering using either your code or the implementation in the Scipy package, then I would like to find out: for each cluster, what are the associated points. Further more, I want to know not only those point features, but also **which rows are the points from**. – zhongqi Dec 23 '11 at 05:10
  • OK, I would have to recompute the distances between all the points to the cluster centers after running Kmeans and record the IDs. – zhongqi Dec 23 '11 at 05:21
  • I think that you could also store your 'features' as extra data in the points tuple. I.E., an example data point tuple would be `(x_cord, y_cord, feature1, feature2, feature3, feature4, feature5, feature6, feature7, feature8, feature9, feature10)`. The only requirement for the point tuple is that the x-y coordinates be in indices 0 and 1, respectively. – Peter Downs Jan 07 '12 at 21:46