1

In Python sklearn KMeans (see documentation), I was wondering what happens internally when passing an ndarray of shape (n, n_features) to the init parameter, When n<n_clusters

  1. Does it drop the given centroids and just starts a kmeans++ initialization which is the default choice for the init parameter ? (PDF paper kmeans++) (How does Kmeans++ work)
  2. Does it consider the given centroids and fill accordingly the remaining centroids using kmeans++ ?
  3. Does it consider the given centroids and fill the remaining centroids using random values ?

I didn't expect that this method returns no warning in this case. That's why I need to know how it manages this.

Community
  • 1
  • 1
belas
  • 277
  • 4
  • 17

1 Answers1

2

If you give it a mismatching init it will adjust the number of clusters, as you can see from the source. This is not documented and I would consider it a bug. I'll propose to fix it.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • 1
    It would be interesting if it fills the remainig according to the Kmeans++ initialization method, considering the given centroids – belas May 12 '15 at 09:11
  • 1
    We could add an option to do that, but it seems very specific. In general, this is probably a sign of an error in user code. We could add an "fill_clusters='kmeans++'" option that by default raises an error. But I'm not sure it is worth adding this code. You can easily implement it yourself, though. – Andreas Mueller May 12 '15 at 20:14
  • How might you implement this? [Link to relevant question and background](https://stackoverflow.com/questions/64921503/define-k-1-cluster-centroids-sklearn-kmeans) – Sean Carter Nov 20 '20 at 16:44