8

I'd like to understand the parameter max_iter from the class sklearn.cluster.KMeans.

According to the documentation:

max_iter : int, default: 300
Maximum number of iterations of the k-means algorithm for a single run.

But in my opinion if I have 100 Objects the code must run 100 times, if I have 10.000 Objects the code must run 10.000 times to classify every object. And on the other hand it makes no sense to run several times over all objects.

What is my misconception and how do I have to interpret this parameter?

C-Jay
  • 621
  • 1
  • 11
  • 22

2 Answers2

5

Take a look here:

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Each time you click update centroids, a new iteration is performed. It makes sense, because when centroids are moved, distances to those centroids also change and some points may change cluster.

DataMan
  • 3,115
  • 6
  • 21
  • 36
mbednarski
  • 758
  • 1
  • 9
  • 17
  • Thanks! It appears that it is a difference beetween k-means from sklearn and Mac Queen (page 283 of his publication: http://projecteuclid.org/download/pdf_1/euclid.bsmsp/1200512992) – C-Jay Dec 01 '16 at 11:03
  • Is it safe to say that the larger the value of max_iter, the better results you will get? – DataMan Jul 16 '17 at 02:16
3

Yes, you are misinterpreting the parameter.

One iteration is one pass over the entire data set. If you have 100 objects, one iteration assigns 100 points. if you have 10000 objects, one iteration processes 10000 objects.

There are more clever algorithms; but sklearn k-means processes every object in every iteration.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194