Spark: use each Array[Vector] from RDD[(Int,Array[Vector])] in MlLib k-mean clustering

Question

I have Vectors grouped by id in RDD like this: RDD[(Int,Array[Vector])]. I want to make clustering on each group of vectors by id in separation.

MlLib k-mean algorithm require RDD[Vector] as an argument:

val kmean = new KMeans().setK(3)
  .setEpsilon(100)
  .setMaxIterations(10)
  .setInitializationMode("k-means||")
  .setSeed(System.currentTimeMillis())

But obviously - when I map my RDD - I get Array[Vector] not wrapped with RDD:

// not work since e._2 is an Array[Vector] not RDD[Vector]! 
rdd.map(e => kmean.run(e._2))

So the question is - how can I perform such clustering?

Thanks for help in advice!

If data fits into local array using MLLib doesn't make sense at all. It will orders (literally) of magnitude slower than optimized local libraries. And you don't have other choice than splitting RDD by group and flattening anyway. — zero323, Jun 11 '16 at 16:43
Good point! But what kind of local clustering library can you sugestia in Scala? I tryied elki but im my opinion API usage, without well defined examples in documentation is slow. And it's Java. — Ziemo, Jun 11 '16 at 16:59
Truth be told I am not aware of any pure Scala implementation. There is Weka but it is Javish and documentation is not so great either. Mahout had some in-core implementation but it has been deprecated without replacement. Incanter has K-means which should be accessible with some effort but I guess it could be to extreme. I am tempted to suggest PySpark with Scikit but I am just a sad Pythonista ;) — zero323, Jun 11 '16 at 17:15
Thinking outside the box you could try [OpenCPU executor](https://github.com/onetapbeyond/opencpu-spark-executor). If you decide to use Spark after all I would reduce number of partitions to minimum (1) and try to compassionate with async submission. I would be also careful with K-Means|| which [can be slow](http://stackoverflow.com/q/35512139/1560062). — zero323, Jun 11 '16 at 17:17
I'd dump the data into CSV and run ELKI on that. There is little reason to do everything in the same JVM. At shutdown, it does not need to do garbage collection. So for large enough tasks, running a new JVM may be worth it. — Has QUIT--Anony-Mousse, Jun 12 '16 at 07:22
For calling ELKI from Java there is a very easy example in the documentation: http://elki.dbs.ifi.lmu.de/wiki/HowTo/InvokingELKIFromJava#PureJavaAPI - should work the same from Scala. — Has QUIT--Anony-Mousse, Jun 12 '16 at 12:53

Spark: use each Array[Vector] from RDD[(Int,Array[Vector])] in MlLib k-mean clustering

0 Answers0