0

I am going through a list of algorithm that I found and try to implement them for learning purpose. Right now I am coding K mean and is confused in the following.

  1. How do you know how many cluster there is in the original data set

  2. Is there any particular format that I have follow in choosing the initial cluster centroid besides all centroid have to be different? For example does the algorithm converge if I choose cluster centroids that are different but close together?

Any advice would be appreciated

Thanks

Kevin
  • 2,187
  • 4
  • 26
  • 35
  • I think it would be constructive to this site if people can explain why the post is down voted, that way I can try my best to improve current and future questions – Kevin Dec 20 '14 at 04:43
  • Probably because the question [how to determine the number of clusters](http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters) has been [asked](http://stackoverflow.com/questions/11360403/how-to-determine-the-k-value-for-k-means-algorithm) [several](http://stackoverflow.com/questions/1793532/how-do-i-determine-k-when-using-k-means-clustering) [times](http://stackoverflow.com/questions/6212690/how-to-optimal-k-in-k-means-algorithm?lq=1) both here and in literature. I would have voted to close as duplicate, because of this. – Has QUIT--Anony-Mousse Dec 20 '14 at 13:05
  • @Anony-Mousse Afaik. the close -> duplicated is the proper way to deal with it then. – inf3rno Mar 23 '18 at 03:21
  • The question referenced by @Anony-Mousse mentions R specifically. The accepted answer is generic and summarizes common methods, also giving R code. The OP's second question, however, does not seem to be asked or answered in that link. – Karthick Aug 16 '18 at 14:39
  • As it was closed as "too broad", I'd recommend to make it more focused on the second topic, and then ask for it to be reopened. As for the second part, it is easy to see that k-means always converges (because there is a finite number or possible states, so no infinite improvement is possible). It may just take longer with a bad starting point, and converge to a worse solution. – Has QUIT--Anony-Mousse Aug 17 '18 at 10:09

2 Answers2

4

With k-means you are minimizing a sum of squared distances. One approach is to try all plausible values of k. As k increases the sum of squared distances should decrease, but if you plot the result you may see that the sum of squared distances decreases quite sharply up to some value of k, and then much more slowly after that. The last value that gave you a sharp decrease is then the most plausible value of k.

k-means isn't guaranteed to find the best possible answer each run, and it is sensitive to the starting values you give it. One way to reduce problems from this is to start it many times, with different starting values, and pick the best answer. It looks a bit odd if an answer for larger k is actually larger than an answer for smaller k. One way to avoid this is to use the best answer found for k clusters as the basis (with slight modifications) for one of the starting points for k+1 clusters.

mcdowella
  • 19,301
  • 2
  • 19
  • 25
0

In the standard K-Means the K value is chosen by you, sometimes based on the problem itself ( when you know how many classes exists OR how many classes you want to exists) other times a "more or less" random value. Typically the first iteration consists of randomly selecting K points from the dataset to serve as centroids. In the following iterations the centroids are adjusted.

After check the K-Means algorithm, I suggest you also see the K-means++, which is an improvement of the first version, as it tries to find the best K for each problem, avoiding the sometimes poor clusterings found by the standard k-means algorithm.

If you need more specific details on implementation of some machine learning algorithm, please let me know.