1

Probably a basic question, I'm pretty new to Spark / Scala.

So I have a variable of type Map[String, RDD[Int]]. I can't iterate through this variable with for and do anything with the RDD within the loop, it throws an error when I try to invoke any kind of action / transformation inside.

I thought that the variable wasn't a RDD so iterating over Map with a simple for loop wouldn't be count as a transformation, so I'm pretty confused. Here is what the codes look like:

def trendingSets(pairRDD: RDD[(String, Int)]): Map[String, RDD[Int]] = {
    pairRDD
      .groupByKey()
      .mapValues(v => { this.sc.parallelize(v.toList) })
      .take(20)
      .toMap
 }

def main(args: Array[String]) {

    val sets = this.trendingSets(pairRDD)

    // inside this loop, no transformations or actions work.
    for((tag, rdd) <- sets) {
       // For instance this fails:
       // val x = rdd.collect()
    }
}

The error is

Exception in thread "main" org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases: 
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.

Any help would be appreciated. Thanks!

OguzGelal
  • 757
  • 7
  • 20
  • @user8371915 Yes I did but I don't understand why it is the case. I didn't think the outer loop was a RDD, so not sure why its saying it is "nested rdds" – OguzGelal Jan 21 '18 at 16:27
  • Because `pairRDD.groupByKey().mapValues(v => { this.sc.parallelize(v.toList) })` would be of type `RDD[(String, RDD[Int]])]`- which is not allowed [SPARK-718](https://issues.apache.org/jira/browse/SPARK-718). Does it makes sense now? – Alper t. Turker Jan 21 '18 at 16:28
  • @user8371915 Oh That's where the problem is ? – OguzGelal Jan 21 '18 at 16:32
  • It is. And I guess the real question is [How do I split an RDD into two or more RDDs?](https://stackoverflow.com/q/32970709/8371915), but in general if `.groupByKey()` is feasible then `parallelize` doesn't make that much sense. – Alper t. Turker Jan 21 '18 at 16:32
  • @user8371915 So I guess I came up with another question that actually solves this problem for me: https://stackoverflow.com/questions/48402479/how-can-i-group-pairrdd-by-keys-and-turn-the-values-into-rdd – OguzGelal Jan 23 '18 at 13:09
  • That one will throw exception same as this code. – Alper t. Turker Jan 23 '18 at 14:16
  • @user8371915 Yeah it did – OguzGelal Jan 23 '18 at 14:30

0 Answers0