8

I have read somewhere that MLlib local vectors/matrices are currently wrapping Breeze implementation, but the methods converting MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope. The suggestion to work around this is to write your code in org.apache.spark.mllib.something package.

Is there a better way to do this? Can you cite some relevant examples?

Thanks and regards,

zero323
  • 322,348
  • 103
  • 959
  • 935
learning_spark
  • 669
  • 1
  • 8
  • 19

6 Answers6

5

I did the same solution as @dlwh suggested. Here is the code that does it:

package org.apache.spark.mllib.linalg

object VectorPub {

  implicit class VectorPublications(val vector : Vector) extends AnyVal {
    def toBreeze : breeze.linalg.Vector[scala.Double] = vector.toBreeze

  }

  implicit class BreezeVectorPublications(val breezeVector : breeze.linalg.Vector[Double]) extends AnyVal {
    def fromBreeze : Vector = Vectors.fromBreeze(breezeVector)
  }
}

notice that the implicit class extends AnyVal to prevent allocation of a new object when calling those methods

Till Rohrmann
  • 13,148
  • 1
  • 25
  • 51
lev
  • 3,986
  • 4
  • 33
  • 46
  • 1
    This code is placed inside the spark mllib.linalg package. That is not a viable general solution for clients of the mllib framework: they should not be touching the framework classes and packages. – WestCoastProjects Feb 02 '15 at 18:48
  • It's in the spark.mllib.linalg package, but spark shouldn't be recompiled for this. Only create a new assembly that wraps the existing spark assembly, and add this class there. It's kinda hacky, but It's the best I found. – lev Feb 03 '15 at 03:30
  • 1
    Stuff like this is a bit dangerous. For instance, if you take a slice of your breeze vector and attempt to wrap it with `fromBreeze`, it will fail. – VF1 Aug 28 '16 at 23:09
3

My solution is kind of a hybrid of those of @barclar and @lev, above. You don't need to put your code in the org.apache.spark.mllib.linalg if you don't make use of the spark-ml implicit conversions. You can define your own implicit conversions in your own package, like:

package your.package

import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.ml.linalg.Vector
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}

object BreezeConverters
{
    implicit def toBreeze( dv: DenseVector ): BDV[Double] =
        new BDV[Double](dv.values)

    implicit def toBreeze( sv: SparseVector ): BSV[Double] =
        new BSV[Double](sv.indices, sv.values, sv.size)

    implicit def toBreeze( v: Vector ): BV[Double] =
        v match {
            case dv: DenseVector => toBreeze(dv)
            case sv: SparseVector => toBreeze(sv)
        }

    implicit def fromBreeze( dv: BDV[Double] ): DenseVector =
        new DenseVector(dv.toArray)

    implicit def fromBreeze( sv: BSV[Double] ): SparseVector =
        new SparseVector(sv.length, sv.index, sv.data)

    implicit def fromBreeze( bv: BV[Double] ): Vector =
        bv match {
            case dv: BDV[Double] => fromBreeze(dv)
            case sv: BSV[Double] => fromBreeze(sv)
        }
}

Then you can import these implicits into your code with:

import your.package.BreezeConverters._
corvi42
  • 31
  • 5
2

As I understand it, the Spark people do not want to expose third party APIs (including Breeze) so that it's easier to change if they decide to move away from them.

You could always put just a simple implicit conversion class in that package and write the rest of your code in your own package. Not much better than just putting everything in there, but it makes it a little more obvious why you're doing it.

dlwh
  • 2,257
  • 11
  • 23
  • 2
    putting code in the mllib.linalg package is not a viable solution for clients of the mllib framework – WestCoastProjects Feb 01 '15 at 01:35
  • 1
    I agree it's dumb, but you only have to put one little class (as witnessed by @lev), and it's the best workaround that doesn't involve needless creation of extra arrays, like your solution below. – dlwh Feb 01 '15 at 02:44
  • (I of course think they should just expose Breeze as "experimental" if they want to reserve the right to change it, but it's out of my hands.) – dlwh Feb 01 '15 at 02:45
  • But adding to the mllib/linalg is an "out of bounds" solution for a general client (which shall not modify that package) : it is a non-starter. Neither do I prefer my solution in terms of convenience: but at least it is "legal". If you have an idea for a generally *permissible* solution that is better I am all for it. – WestCoastProjects Feb 01 '15 at 05:43
  • Anything that's a better solution requires politicking, I'm afraid. – dlwh Feb 01 '15 at 18:08
1

Here is the best I have so far. Note to @dlwh: please do provide any improvements you might have to this.

The solution I could come up with - that does not put code inside the mllib .linalg package - is to convert each Vector to a new Breeze DenseVector.

val v1 = Vectors.dense(1.0, 2.0, 3.0)
val v2 = Vectors.dense(4.0, 5.0, 6.0)
val bv1 = new DenseVector(v1.toArray)
val bv2 = new DenseVector(v2.toArray)
val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout: org.apache.spark.mllib.linalg.Vector = [5.0,7.0,9.0]
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • This seems to be a good solution, at least worked for my purpose, but when we do `v1.toArray` we are collecting all the elements of `v1`, that could potentially cause problems when for example, 'v1` is huge and cannot fit in RAM! – Ehsan M. Kermani Jul 02 '15 at 05:35
1

This solution avoids putting code into Spark's packages and avoids converting sparse to dense vectors:

def toBreeze(vector: Vector) : breeze.linalg.Vector[scala.Double] = vector match {
      case sv: SparseVector => new breeze.linalg.SparseVector[Double](sv.indices, sv.values, sv.size)
      case dv: DenseVector => new breeze.linalg.DenseVector[Double](dv.values)
    }
barclar
  • 523
  • 4
  • 11
0

this is a method i wort to convert an Mlib DenceMatrix to a breeze matrix, maybe it help!!

import breeze.linalg._
import org.apache.spark.mllib.linalg.Matrix

def toBreez(X:org.apache.spark.mllib.linalg.Matrix):breeze.linalg.DenseMatrix[Double] = {
var i=0;
var j=0;
val m = breeze.linalg.DenseMatrix.zeros[Double](X.numRows,X.numCols)
for(i <- 0 to X.numRows-1){
  for(j <- 0 to X.numCols-1){
    m(i,j)=X.apply(i, j)
  }
}
m
}