15

When I use "++" to combine a lot of RDDs, I got error stack over flow error.

Spark version 1.3.1 Environment: yarn-client. --driver-memory 8G

The number of RDDs is more than 4000. Each RDD is read from a text file with size of 1 GB.

It is generated in this way

val collection = (for (
  path <- files
) yield sc.textFile(path)).reduce(_ union _)

It works fine when files has small size. And there is the error

The error repeats itself. I guess it is a recursion function which is called too many time?

 Exception at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
  .....
worldterminator
  • 2,968
  • 6
  • 33
  • 52

2 Answers2

22

Use SparkContext.union(...) instead to union many RDDs at once.

You don't want to do it one at a time like that since RDD.union() creates a new step in the lineage (an extra set of stack frames on any computation) for each RDD, whereas SparkContext.union() makes it all at once. This will insure not getting a stack-overflow error.

eliasah
  • 39,588
  • 11
  • 124
  • 154
Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • I totally agree but I was just wondering if it insures not getting a stack-overflow error? – eliasah May 31 '15 at 09:15
  • 1
    Yes, since `RDD.union()` creates a new step in the lineage (an extra set of stack frames on any computation) for each RDD, whereas `SparkContext.union()` makes it all at once. – Sean Owen May 31 '15 at 12:23
  • Thanks! I'll edit your answer adding the information you mentioned in your comment. I believe it completes the answer. – eliasah May 31 '15 at 12:26
  • Is there a similar method to join multiple data frames at once.I am facing a similar issue where I am joining one dataframe at a time instead of joining all at once. https://stackoverflow.com/questions/57952740/joining-huge-list-of-data-frames-causes-stack-overflow-error – Hitesh Sep 16 '19 at 09:54
1

It seems that when union RDD one by one can get into a series of very long recursive function calls. In this case we need to increase JVM stack memory. In spark with option --driver-java-options "-Xss 100M", driver jvm stack memory is configured to 100M.

Sean Owen's solution also solves the problem in more elegant way.

worldterminator
  • 2,968
  • 6
  • 33
  • 52