How to convert spark RDD to mahout DRM?

Question

I am fetching data from Alluxio in Mahout using sc.textFile(), but it is spark RDD. My program further uses this spark RDD as Mahout DRM, therefore I needed to convert RDD to DRM. So my current code remains stable.

rawkintrevo · Answer 1 · 2017-04-07T14:15:50.260

An Apache Mahout DRM can be created from an Apache Spark RDD in the following steps:

Convert each row of the RDD into a Mahout Vector
Zip the RDD with an Index (and swap so the tuple is of form (Long, Vector)
Wrap the RDD with a DRM.

Consider the following Example Code:

val rddA = sc.parallelize(Array((1.0, 2.0, 3.0),
            ( 2.0, 3.0, 4.0),
            ( 4.0, 5.0, 6.0)))

val drmRddA: DrmRdd[Long] = rddA.map(a => new DenseVector(a))
                 .zipWithIndex()
                 .map(t => (t._2, t._1))

val drmA = drmWrap(rdd= drmRddA)

Source /more info/ Shameless Self Promotion (toward the bottom): my Blog

score 1 · Answer 2 · answered Apr 07 '17 at 15:36

The main issue with converting data is often that Mahout uses integers to reference row and column numbers of a general matrix but data usually has its own row and column keys, which are string ids of some sort.

Mahout has an object called an IndexedDatasetSpark which preserves the ids in BiMaps (actually BiDictionaries) but also creates a Mahout DRM. The benefit being that the dictionaries will convert the integers for row and columns back into your IDs after the math is done.

If you have an RDD[String, String] of elements for a matrix this will do the conversion. If you have an array of rows, you may be able to start from this to code your own conversion.

https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala#L75

For an example of how to transform an RDD into an IDS please see [this gist](https://gist.github.com/rawkintrevo/c1bb00896263bdc067ddcd8299f4794c) — rawkintrevo, Apr 07 '17 at 16:00

How to convert spark RDD to mahout DRM?

2 Answers2