0

I have the logs of the amount of arrivals at a bank , every half an hour for one month.

I am trying to find different cluster groups according to the amount of "arrivals". I tried according to the day, and i tried according to the hour (not of a specific day). I would like to try according to the hour of a specific day.

An example:

  • Thursdays at 14:00 and Sundays at 15:00 are one cluster with an average of 10000 arrivals
  • Mondays at 13:00, Mondays at 10:00 and Tuesdays at 16:00 are one cluster with an average of 15000 arrivals.
  • all the rest are another cluster with an average of 2000 arrivals.

I have a csv file with the columns: Date, Day(1-7), Time, Arrivals

Until now I used this:

km <- kmeans(table, 3, 15)
plot(km)

(i tried 3 clusters) - this code clusters pairs .( a matrix of 3x3 with a plot of each 2 out of 3 columns)

Is there a way to do that?

Jaap
  • 81,064
  • 34
  • 182
  • 193
user3649137
  • 33
  • 1
  • 1
  • 5
  • 1
    how is your data formatted? – ThatGuy May 25 '14 at 06:58
  • 2
    Welcome to SO. Please read [how to make a great R reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It’s always best to at least post some sample data (input) and maybe give an example of what you think the output should be. Also share any code that you’ve tried so far. This will make it much easier for others to help you. – MrFlick May 25 '14 at 07:17
  • i Re-edited the question with further details – user3649137 May 25 '14 at 07:42
  • You could use `ddply` to regroup the data with a key that is the concatenation of 'time/hour' and 'specific day' – Reuben L. May 25 '14 at 08:32

1 Answers1

0

k-means and similar algorithms will yield meaningless results on this kind of data.

The problem is you are using the wrong tool for the wrong problem on the wrong data.

Your data is: Date, Day(1-7), Time, Arrivals

K-means will try to minimize variance. But does variance make any sense on this data set? How do you know hich k makes most sense? Since Arrivals likely has the largest variance of these attributes, it will completely dominate your result.

The question you should first try to answer is: what is a good result? Then, consider ways of visualizing the results to verify that you are up to something. And when you've visualized the data, consider ways to manually mark the desired result on the visualization, this may well be good enough for you. Better than praying for k-means to yield a somewhat meaningful result; because on this kind of mixed type data, it usually does not work very well.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks, what i am trying to do is find different groups of types of congestion. (as written above) what method would you recommend instead of k-means? if i noramlize both day and arrivals, would it improve the outcome? – user3649137 May 26 '14 at 12:30
  • If you want less random random results, it is a good idea to z-standardize your data before. But don't expect the result to be meaningful if you treat clustering as a black box. – Has QUIT--Anony-Mousse May 26 '14 at 13:54