3

I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference. I'm doing something simple, similar to the following:

from sklearn.cluster import KMeans

clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)

data elements looks something like [22, 418, 192] where the first element is the hour.

Any ideas?

leonsas
  • 4,718
  • 6
  • 43
  • 70

3 Answers3

3

Even though @elyase answer is accepted, I think it is not the correct approach.

Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:

#current cluster X,, based on centroid position Xc=24
x1=1
x2=24

#current cluster Y, based on centroid position Yc=10
y1=12
y2=13

computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.

  • Simple k-means will result in clusters {x1,y1}, {x2,y2}
  • Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
  • Correct clustering would be {x1,x2},{y1,y2}

Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.

This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).

Details of the kernel k-means are given here

lejlot
  • 64,777
  • 8
  • 131
  • 164
1

Why k-means doesn't work with arbitrary distances

K-means is not a distance-based algorithm.

K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).

In order for Lloyds algorithm to converge you need to have both steps optimize the same function:

  • the reassignment step
  • the centroid update step

Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.

If you plug in a random other distance function as suggested by @elyase k-means might no longer converge.

Proper solutions

There are various solutions to this:

  • Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
  • Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
  • Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • This all makes sense. I had the naive thought that using the cylindrical distance instead of the euclidean could yield appropriate results. I will read more about your suggestions before accepting as the answer. Thanks! – leonsas Sep 09 '13 at 15:41
0

The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.

#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)

sin = np.sin(angles)
cos = np.cos(angles)

a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi

#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5
Peacher Wu
  • 381
  • 2
  • 5