2

I have a list of dates I'd like to cluster into 3 clusters. Now, I can see hints that I should be looking at k-means, but all the examples I've found so far are related to coordinates, in other words, pairs of list items.

I want to take this list of dates and append them to three separate lists indicating whether they were before, during or after a certain event. I don't have the time for this event, but that's why I'm guessing it by breaking the date/times into three groups.

Can anyone please help with a simple example on how to use something like numpy or scipy to do this?

MierMoto
  • 135
  • 1
  • 6
  • Why not just sort the list and choose two points in time as the splits? If you're thinking in terms of clustering, do you expect the times to be "bunched up" near the event you're looking for? In that case, why not take the densest times as your event time? – chthonicdaemon Mar 29 '14 at 13:53
  • @chthonicdaemon Cheers for the question. The times should be "bunched up" in three groups I would suspect. These are from photos taken during a operations, so they're supposed to take some photos before they start, then during the procedure and then after, so there should be two natural gaps between the times. I'm trying to split these times off into three groups. – MierMoto May 14 '14 at 07:12

2 Answers2

1

k-means is exclusively for coordinates. And more precisely: for continuous and linear values.

The reason is the mean functions. Many people overlook the role of the mean for k-means (despite it being in the name...)

On non-numerical data, how do you compute the mean?

There exist some variants for binary or categorial data. IIRC there is k-modes, for example, and there is k-medoids (PAM, partitioning around medoids).

It's unclear to me what you want to achieve overall... your data seems to be 1-dimensional, so you may want to look at the many questions here about 1-dimensional data (as the data can be sorted, it can be processed much more efficiently than multidimensional data).

In general, even if you projected your data into unix time (seconds since 1.1.1970), k-means will likely only return mediocre results for you. The reason is that it will try to make the three intervals have the same length.

Do you have any reason to suspect that "before", "during" and "after" have the same duration? If not, don't use k-means.

You may however want to have a look at KDE; and plot the estimated density. Once you have understood the role of density for your task, you can start looking at appropriate algorithms (e.g. take the derivative of your density estimation, and look for the largest increase / decrease, or estimate an "average" level, and look for the longest above-average interval).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

Here are some workaround methods that may not be the best answer but should help.

You can plot the dates as converted durations from a starting date (such as one week) and convert the dates to number representations for time in minutes or hours from the starting point.

These would all graph along an x-axis but Kmeans should still be possible and clustering still visible on a graph.

Here are more examples of numpy:Python k-means algorithm

Community
  • 1
  • 1
amanda fouts
  • 347
  • 2
  • 10
  • I don't think I understand entirely, but I've fleshed out the question a bit more. All I can see from the examples is the use of pairs, but I'm not sure how to only use single list items. It almost seems that k-means is exclusively for coordinates. – MierMoto Mar 28 '14 at 03:01
  • I think you would define K=3 for 3 centroids or clusters. You are more than likely going to have to come up with a way to convert your dates/times into number format like: double date = Convert.ToSingle(DateTime.Now.ToOADate()); [your y axis could be time of day or number of entries] and then use a Kmeans algorithm like the one here: http://codeding.com/articles/k-means-algorithm does that make sense? I think you would need to use x,y coordinates for your dates for kmeans to work as it generally clusters points in a group and would take a large overhaul to operate differently. – amanda fouts Mar 28 '14 at 16:19
  • For x and y: double x = Convert.ToSingle(DateTime.Now.ToOADate()); double y = numberofdateentries – amanda fouts Mar 28 '14 at 16:25
  • Cheers for the info @amanda fouts. I tweaked your suggestion a bit and it worked for my purposes. I just took the dates to seconds, then doubled them up as if they're coordinates and used k-means. Then, using the result I just pulled the first value out of the list of two values which represent my fake coordinates and used that to process further. It's a bit of a cheat, but it worked and I'll use that. K-means seemed to be the only one I could use, which allowed me to specify the number of clusters I wanted. – MierMoto May 16 '14 at 06:22