Issues Aggregating Spark Datasets in Scala

Question

I am computing a series of Dataset aggregations using scala's /: operator. The code for the aggregations is listed below:

def execute1( 
xy: DATASET, 
f: Double => Double): Double = {

println("PRINTING: The data points being evaluated: " + xy)
println("PRINTING: Running execute1")

var z = xy.filter{ case(x, y) => abs(y) > EPS}

var ret = - z./:(0.0) { case(s, (x, y)) => {
   var px = f(x)
   s + px*log(px/y)}  
}

ret
}

My issue occurs when I try running the block for a list of separate functions which are passed in as the f parameter. The list of functions is:

  lazy val pdfs = Map[Int, Double => Double](
1 -> betaScaled,
2 -> gammaScaled,
3 -> logNormal,
4 -> uniform,
5 -> chiSquaredScaled
)

The executor function that runs the aggregations through the list is:

  def execute2( 
xy: DATASET, 
fs: Iterable[Double=>Double]): Iterable[Double] = { 
fs.map(execute1(xy, _))
}

With the final execution block:

val kl_rdd  = master_ds.mapPartitions((it:DATASET) => {
val pdfsList = pdfs_broadcast.value.map(
     n => pdfs.get(n).get
)

execute2(it, pdfsList).iterator

The problem is, while the aggregations do occur, they seem to all aggregate in the first slot of the output array, when I would like the aggregation for each function to be displayed separately. I ran tests to confirm that all five functions are actually being run, and that they are being summed in the first slot.

The pre-divergence value: -4.999635700491883
The pre-divergence value: -0.0
The pre-divergence value: -0.0
The pre-divergence value: -0.0
The pre-divergence value: -0.0

This is one of the hardest problems I've ever run into, so any direction would be GREATLY appreciated. Will give credit where its due. Thanks!

dk14 · Accepted Answer · 2017-05-27T05:32:27.290

Spark's dataset doesn't have foldLeft (aka /:): https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.Dataset and actually requires type parameter DataSet[T] and its name is not all capital case.

So, I suppose your DATASET's type is an iterator, so it gets drained after first run of execute1, so every subsequent execute1 gets empty iterator. Basically, it doesn't aggregate all functions - it just executes first one and ignores the other ones (you get -0.0 because you passed 0.0 as initial value to foldLeft).

As you can see from mapPartitions signature:

def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]

it gives you an iterator (mutable structure that can be traversed only once), so you should do it.toList in order to get (potentially but limited large) immutable structure (List).

P.S. if you want to really work with Spark's DataSet/RDD - use aggregate (RDD) or agg (DataSet). See also: foldLeft or foldRight equivalent in Spark?

Explanation about iterators:

scala> val it = List(1,2,3).toIterator
it: Iterator[Int] = non-empty iterator

scala> it.toList //traverse iterator and accumulate its data into List
res0: List[Int] = List(1, 2, 3)

scala> it.toList //iterator is drained, so second call doesn't traverse anything
res1: List[Int] = List()

Issues Aggregating Spark Datasets in Scala

1 Answers1