1

Can I use cluster_center coordinates from a previous Kmeans fit as an init argument to sequentially update the cluster_center coordinates as new data arrives? Are there any drawbacks to this method?

UPDATED Online version of Scikit learns K-means:

KM = KMeans(n_clusters=3, random_state = 200, n_init = 1)
ni = 0

Until interrupted: 

for x in data:

    KM_updated = KM.fit(x)

    Updated_centroids(i) = KM_updated.cluster_centers_(i) + 1/len(KM_updated.labels_(i) + 1) * (x - KM_updated.cluster_centers_(i))
            
    KM = KMeans(n_clusters=3, random_state = 200, init = Updated_centroids(i), n_init = 1)
Braiam
  • 1
  • 11
  • 47
  • 78
endorphinus
  • 119
  • 1
  • 1
  • 8

1 Answers1

1

Yes it is a possible solution. However, you could further improve your implementation by following this pseudo-code (for more info take a look at this post Online k-means clustering):

Make initial guesses for the means m1, m2, ..., mk
Set the counts n1, n2, ..., nk to zero
Until interrupted
    Acquire the next example, x
    If mi is closest to x
        Increment ni
        Replace mi by mi + (1/ni)*( x - mi)
    end_if
end_until

Following this version of the online algorithm you only need to remember the mean of each cluster and the count of the number of data points assigned to the cluster. Once you update those two variables, you can forget the new data point.

Compared to yours, in this solution you won't need to keep the past data, so it is computationally more efficient.

This exact implementation is not available in Scikit Learn. The closest implementation is probably the MiniBatchKMeans estimator with the partial_fit method.

Molessia
  • 464
  • 1
  • 4
  • 17
  • Thanks! What are the drawbacks of using the method I mentioned as opposed to the method you mention? – endorphinus Aug 05 '20 at 16:19
  • Also, is there a neat way to use the sci-kit learn Kmean API to implement the method you mentioned? Now I just fit the new data point with kmeans and previously calculated centroids. – endorphinus Aug 05 '20 at 17:10
  • This exact implementation is not available in Scikit Learn. The closest implementation is probably the MiniBatchKMeans (https://scikit-learn.org/stable/modules/clustering.html#mini-batch-kmeans) estimator with the partial_fit method. – Molessia Aug 06 '20 at 09:31
  • The difference between the proposed solutions and yours is that in the online k-means you won't need to keep the past data, so it is computationally more efficient. – Molessia Aug 06 '20 at 09:33
  • Well the problem with partial_fit and feeding one instance at a time with MiniBatchKmeans is that it sometimes does not converge and gives poor results. see for example here: https://github.com/scikit-learn/scikit-learn/issues/13476. So I am left with writing the online algorithm from scratch. Have you seen any examples around? – endorphinus Aug 06 '20 at 13:35
  • One way might be to ´hack´ the sklearn Kmeans, although not entirely correct, see the updated online version above. – endorphinus Aug 06 '20 at 13:55
  • I found a few references on GitHub of online k-Means implementation in Python (you can find others implemented for example in C++ or Go): https://github.com/dieggoluis/online-kmeans/blob/master/online_kmeans.ipynb https://gist.github.com/yjzhang/aaf460849a4398422785c0e85932688d – Molessia Aug 06 '20 at 13:59
  • Thank you AlessiaM! – endorphinus Aug 07 '20 at 04:58
  • 1
    You're welcome! If you found my answer helpful please consider to select it as accepted. This will help other users who might be into a similar problem! – Molessia Aug 07 '20 at 08:45