1

In Spark, it is possible to compose multiple RDD into one, using zip, union, join, etc...

Is it possible to decompose RDD efficiently? Namely, without performing multiple passes on the original RDD? What I am looking for is some thing similar to:

val rdd: RDD[T] = ...     
val grouped: Map[K, RDD[T]] = rdd.specialGroupBy(...) 

One of the strengths of RDDs is that they enable performing iterative computations efficiently. In some (machine learning) use cases I encountered, we need to perform iterative algorithms on each of the groups separately.

The current possibilities I am aware of are:

  1. GroupBy: groupBy returns an RDD[(K, Iterable[T])] which does not give you the RDD benefits on the group itself (the iterable).

  2. Aggregations: Such as reduceByKey, foldByKey, etc. perform only one "iteration" over the data, and do not have the expression power for implementing iterative algorithms.

  3. Creating separate RDD using the filter method and multiple passes on the data (where the number of passes is equal to the number of keys), which is not feasible when the number of keys is not very small.

Some of the use cases I am considering are, given a very large (tabular) dataset:

  1. We wish to execute some iterative algorithm on each of the different columns separately. For example, some automated feature extraction, A natural way to do so, would have been to decompose the dataset such that each of the columns will be represented by a separate RDD.

  2. We wish to decompose the dataset into disjoint datasets (for example a dataset per day) and execute some machine learning modeling on each of them.

Mikel Urkia
  • 2,087
  • 1
  • 23
  • 40

1 Answers1

0

I think the best option is to write out the data in a single pass to one file per key (see Write to multiple outputs by key Spark - one Spark job) then load the per-key files into one RDD each.

Community
  • 1
  • 1
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
  • Daniel, I tried the above approach. The problem with what you suggested is that the object are written as strings, namely you lose the types. I wrote the following code (for local mode) which overcomes this problem: https://gist.github.com/MishaelRosenthal/108ebbbb7590c7d3104b But for some reason it is extremely slow. What I am suspecting is that for some reason it iterates over the whole data numerous times. – Mishael Rosenthal May 13 '15 at 13:17
  • No idea, sorry. Your code looks to me. I haven't tried to do this in practice myself, so I don't know what performance to expect. Perhaps you can get an idea of what it's doing through the Spark UI (stages). – Daniel Darabos May 13 '15 at 14:21