6

I'm having the same problem as in this post, but I don't have enough points to add a comment there. My dataset has 1 Million rows, 100 cols. I'm using Mllib KMeans also and it is extremely slow. The job never finishes in fact and I have to kill it. I am running this on Google cloud (dataproc). It runs if I ask for a smaller number of clusters (k=1000), but still take more than 35 minutes. I need it to run for k~5000. I have no idea why is it so slow. The data is properly partitioned given the number of workers/nodes and SVD on a 1 million x ~300,000 col matrix takes ~3 minutes, but when it comes to KMeans it just goes into a black hole. I am now trying a lower number of iterations (2 instead of 100), but I feel something is wrong somewhere.

KMeansModel Cs = KMeans.train(datamatrix, k, 100);//100 iteration, changed to 2 now. # of clusters k=1000 or 5000
Community
  • 1
  • 1
Kai
  • 1,464
  • 4
  • 18
  • 31
  • changing the # iteration to 2 made NO difference at all. – Kai Feb 19 '16 at 20:27
  • Kai, I have a [similar problem](http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata). However, in my case the job simply *hangs*, it's not just that it's slow. Would you see any progress when running your job and it would be just slow, or it would do nothing, like in my case? – gsamaras Sep 02 '16 at 16:53

2 Answers2

6

It looks like the reason is relatively simple. You use quite large k and combine it with an expensive initialization algorithm.

By default Spark is using as distributed variant of K-means++ called K-means|| (see What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?). Distributed version is roughly O(k) so with larger k you can expect slower start. This should explain why you see no improvement when you reduce number of iterations.

Using large K is also expensive when model is trained. Spark is using a variant of Lloyds which is roughly O(nkdi).

If you expect complex structure of the data there most likely a better algorithms out there to handle this than K-Means but if you really want to stick with it you start with using random initialization.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • are you saying that most of the time is consumed by this "initialization"? – Kai Feb 20 '16 at 01:36
  • I am saying this an expensive step and for accounts for behavior you see. But more important is that training K-means with thousands of clusters cannot perform well. – zero323 Feb 20 '16 at 02:02
  • 1
    just ran spark job with 5000 custer, random initialization, finished in 7 min!! Awesome!! now I'll go read the papers see the impact on accuracy. Thank you, yet again zero. As for the number of clusters, I think the dimensionality of the problem is a lot more critical-> in very high dims every point is "far" from every other point. The number of points isn't really important for more than execution speed. – Kai Feb 20 '16 at 02:55
  • Dimensionality can affect the output but for different reasons (both algorithmically and available optimizations) doesn't affect running time as much. But I am glad it was helpful. – zero323 Feb 20 '16 at 12:12
2

Please try other implementations of k-means. Some like the variants in ELKI are way better than Spark, even on only a single CPU. You will be surprised how much performance you can get out of a single node, without going to a cluster! From my experiments, you would need at least a 100 node cluster to beat good local implementations, unfortunately.

I read that these C++ versions are multi-core (but single-node) and probably the fastest K-means you can find right now, but I have not yet tried that myself yet (for all my needs, the ELKI versions were bazingly fast, finishing in a few seconds on my largest data sets).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194