6

In Scikit-learn , K-Means have n_jobs but MiniBatch K-Means is lacking it. MBK is faster than KMeans but at large sample sets we would like it distribute the processing across multiprocessing (or other parallel processing libraries).

Is MKB's Partial-fit the answer?

Phyo Arkar Lwin
  • 6,673
  • 12
  • 41
  • 55

1 Answers1

3

I don't think this is possible. You could implement something with OpenMP inside the minibatch processing. I'm not aware of any parallel minibatch k-means procedures. Parallizing stochastic gradient descent procedures is somewhat hairy.

Btw, the n_jobs parameter in KMeans only distributes the different random initializations afaik.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • It might be possible to warm up a model to reach a local minimum basin and then fine tune on partitions of the dataset with clones of the original model with averaging from time to time. I have never tried it though. – ogrisel Jun 12 '13 at 15:58
  • Is there a particular reason why you would warm up and not start with partitions? Also, how do you average? Try to find correspondences between the clusters and then just average the centers? Or do you warm start to have some good initialization and expect the correspondence to be stable? – Andreas Mueller Jun 12 '13 at 16:01
  • It's unlikely that centroid #2 of model #0 will be close to centroid #2 of model #1... The warm up is to make it possible to have stable match of centroids. – ogrisel Jun 12 '13 at 16:35
  • Oh well .. not familiar with OpenMP , so i am out of luck. it is easy to do distributed in Supervised learning like LinearSVC so i was thought in MBK it is possible. So how clouds with multiple nodes uses Kmeans? They don't at all? – Phyo Arkar Lwin Jun 12 '13 at 21:23
  • Hey guys , how about this one? looks interesting. https://code.google.com/p/ddk-means-clustering-system/ and i found this too https://code.google.com/p/dynamic-distributed-kmeans-clustering-python/source/list – Phyo Arkar Lwin Jun 12 '13 at 21:29
  • It is easy to do LinearSVC in parallel? Really? How? You can parallelize the OvR but I don't think sklearn supports that. – Andreas Mueller Jun 17 '13 at 07:06
  • And yes, there are ways to parallelize KMeans somehow. But that is highly non-trivial and uses possibly different algorithms. I don't think there is a well-established method. Maybe look at mahout. – Andreas Mueller Jun 17 '13 at 07:08
  • @AndreasMueller Seems you forgot your own answer to me :D http://stackoverflow.com/questions/13068257/multiprocessing-scikit-learn/13082746#13082746 – Phyo Arkar Lwin Jun 19 '13 at 16:12
  • DDKMeans algorithm on googlecode is inefficient , it implement using Pure Python , they do not even use numpy ,gonna be very slow, and they use built-in sockets with peer2peer distribution. May be you guys can implement something similar in SKlearn ? I also found some papers on ParallelKmeans and DDKMeans , i will read them more into it and see if i can contribute. – Phyo Arkar Lwin Jun 19 '13 at 16:17
  • @V3ss0n my other answer was about SGDClassifier. That is in a way a much simpler problem (convex for example). – Andreas Mueller Jun 22 '13 at 16:30
  • Yeah my mistake. Here you guys had a discussion about it back in end of 2011 : http://comments.gmane.org/gmane.comp.python.scikit-learn/1287 , I am going to check original poster of it. But OpenMP is only single PC implementation , cannot distribute processing across clusters right? We using ZeroMQ (With ZeroRPC) for Parallizing operations across multiple servers so Would be interesting if we can implement one using ZeroMQ. – Phyo Arkar Lwin Jul 02 '13 at 14:56
  • Also there is a C Implementation using MPI , seems working very well : http://users.eecs.northwestern.edu/~wkliao/Kmeans/ . Would be nice to have inside sklearn. – Phyo Arkar Lwin Jul 02 '13 at 15:01