Will Spark store the excess data to disk by default, if size of input RDD is more than the memory capacity

Question

I have an input file of size 260GB and my spark cluster memory capacity is 140 gb, upon running my spark job will, the excess data of 120 be stored to disk by default or should I use some storage levels to specify it.

I have not tried any solutions to solve this issue.

 def main(args: Array[String]){
 val conf:SparkConf = new SparkConf().setAppName("optimize_1").setMaster("local")
     val sc:SparkContext = new SparkContext(conf)

 val myRDD = sc.parallelize( List(("1", "abc", "Request"), ("1", "cba", "Response"), ("2", "def", "Request"), ("2", "fed", "Response"), ("3", "ghi", "Request"), ("3", "ihg", "Response")) )

 val myRDD_1 = sc.parallelize( List(("1", "abc"), ("1", "cba"), ("2", "def"), ("2", "fed"), ("3", "ghi"), ("3", "ihg")) )


 myRDD_1.map(x=>x).groupBy(_._1).take(10).foreach(println)

 myRDD_1.groupByKey().foreach(println)  }

Below is the expected and working output for small data:

(2,CompactBuffer(def, fed))

(3,CompactBuffer(ghi, ihg))

(1,CompactBuffer(abc, cba))

But when applying it on a large scale I am receiving the following error:

"Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@DOSSPOCVM1:33303 --executor-id 8 --hostname DOSSPOCVM1 --cores 1 --app-id application_1555417914353_0069 --user-class-path file:$PWD/app.jar 1>/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009/stdout 2>/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009/stderr""

ERROR YarnClusterScheduler: Lost executor 17 on DOSSPOCVM2: Container marked as failed: container_e05_1555417914353_0069_02_000019 on host: DOSSPOCVM2. Exit status: -100. Diagnostics: Container released on a lost node

Please suggest a way to resolve this issue

[Avoid GroupByKey](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html) — user10938362, Jun 11 '19 at 12:17
@user10958683 Could you please guide me with sample code on how to perform the above with reducebykey or aggregatebykey , since i don't know how to code the same logic using reducebykey , am failing at that part — BalaKumar, Jun 11 '19 at 13:08
That's more [about the overall logic, than the specific method you use](https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey). — user10938362, Jun 11 '19 at 13:31
try to repartition your data in order to ensure that spark works with smaller data chunks. In your case should be `260GB / 500MB ~ 520` partitions, where 500MB is the ideal partition size for Spark — abiratsis, Jun 11 '19 at 17:01
@AlexandrosBiratsis I tried the re partition command but facing OOM issues again Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009 -XX:OnOutOfMemoryError='kill %p' — BalaKumar, Jun 13 '19 at 09:46
May be an idea to answer your own question. I gave an answer only to get a -1, interested in the lost node aspect. — thebluephantom, Jul 02 '19 at 10:07

Will Spark store the excess data to disk by default, if size of input RDD is more than the memory capacity

0 Answers0