0

I have Vectors grouped by id in RDD like this: RDD[(Int,Array[Vector])]. I want to make clustering on each group of vectors by id in separation.

MlLib k-mean algorithm require RDD[Vector] as an argument:

val kmean = new KMeans().setK(3)
  .setEpsilon(100)
  .setMaxIterations(10)
  .setInitializationMode("k-means||")
  .setSeed(System.currentTimeMillis())    

But obviously - when I map my RDD - I get Array[Vector] not wrapped with RDD:

// not work since e._2 is an Array[Vector] not RDD[Vector]! 
rdd.map(e => kmean.run(e._2))

So the question is - how can I perform such clustering?

Thanks for help in advice!

Ziemo
  • 941
  • 8
  • 27
  • 2
    If data fits into local array using MLLib doesn't make sense at all. It will orders (literally) of magnitude slower than optimized local libraries. And you don't have other choice than splitting RDD by group and flattening anyway. – zero323 Jun 11 '16 at 16:43
  • Good point! But what kind of local clustering library can you sugestia in Scala? I tryied elki but im my opinion API usage, without well defined examples in documentation is slow. And it's Java. – Ziemo Jun 11 '16 at 16:59
  • Truth be told I am not aware of any pure Scala implementation. There is Weka but it is Javish and documentation is not so great either. Mahout had some in-core implementation but it has been deprecated without replacement. Incanter has K-means which should be accessible with some effort but I guess it could be to extreme. I am tempted to suggest PySpark with Scikit but I am just a sad Pythonista ;) – zero323 Jun 11 '16 at 17:15
  • Thinking outside the box you could try [OpenCPU executor](https://github.com/onetapbeyond/opencpu-spark-executor). If you decide to use Spark after all I would reduce number of partitions to minimum (1) and try to compassionate with async submission. I would be also careful with K-Means|| which [can be slow](http://stackoverflow.com/q/35512139/1560062). – zero323 Jun 11 '16 at 17:17
  • I'd dump the data into CSV and run ELKI on that. There is little reason to do everything in the same JVM. At shutdown, it does not need to do garbage collection. So for large enough tasks, running a new JVM may be worth it. – Has QUIT--Anony-Mousse Jun 12 '16 at 07:22
  • 1
    For calling ELKI from Java there is a very easy example in the documentation: http://elki.dbs.ifi.lmu.de/wiki/HowTo/InvokingELKIFromJava#PureJavaAPI - should work the same from Scala. – Has QUIT--Anony-Mousse Jun 12 '16 at 12:53

0 Answers0