0

I have an input file of size 260GB and my spark cluster memory capacity is 140 gb, upon running my spark job will, the excess data of 120 be stored to disk by default or should I use some storage levels to specify it.

I have not tried any solutions to solve this issue.

 def main(args: Array[String]){
 val conf:SparkConf = new SparkConf().setAppName("optimize_1").setMaster("local")
     val sc:SparkContext = new SparkContext(conf)

 val myRDD = sc.parallelize( List(("1", "abc", "Request"), ("1", "cba", "Response"), ("2", "def", "Request"), ("2", "fed", "Response"), ("3", "ghi", "Request"), ("3", "ihg", "Response")) )

 val myRDD_1 = sc.parallelize( List(("1", "abc"), ("1", "cba"), ("2", "def"), ("2", "fed"), ("3", "ghi"), ("3", "ihg")) )


 myRDD_1.map(x=>x).groupBy(_._1).take(10).foreach(println)

 myRDD_1.groupByKey().foreach(println)  }

Below is the expected and working output for small data:

(2,CompactBuffer(def, fed))

(3,CompactBuffer(ghi, ihg))

(1,CompactBuffer(abc, cba))

But when applying it on a large scale I am receiving the following error:

"Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@DOSSPOCVM1:33303 --executor-id 8 --hostname DOSSPOCVM1 --cores 1 --app-id application_1555417914353_0069 --user-class-path file:$PWD/app.jar 1>/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009/stdout 2>/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009/stderr""

ERROR YarnClusterScheduler: Lost executor 17 on DOSSPOCVM2: Container marked as failed: container_e05_1555417914353_0069_02_000019 on host: DOSSPOCVM2. Exit status: -100. Diagnostics: Container released on a lost node

Please suggest a way to resolve this issue

abiratsis
  • 7,051
  • 3
  • 28
  • 46
BalaKumar
  • 19
  • 3
  • [Avoid GroupByKey](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html) – user10938362 Jun 11 '19 at 12:17
  • @user10958683 Could you please guide me with sample code on how to perform the above with reducebykey or aggregatebykey , since i don't know how to code the same logic using reducebykey , am failing at that part – BalaKumar Jun 11 '19 at 13:08
  • That's more [about the overall logic, than the specific method you use](https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey). – user10938362 Jun 11 '19 at 13:31
  • try to repartition your data in order to ensure that spark works with smaller data chunks. In your case should be `260GB / 500MB ~ 520` partitions, where 500MB is the ideal partition size for Spark – abiratsis Jun 11 '19 at 17:01
  • @AlexandrosBiratsis I tried the re partition command but facing OOM issues again Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009 -XX:OnOutOfMemoryError='kill %p' – BalaKumar Jun 13 '19 at 09:46
  • Often rdd will be more than memory capacity. Did u solve? – thebluephantom Jun 30 '19 at 08:25
  • @thebluephantom - yes , this problem is solved.thank you – BalaKumar Jul 02 '19 at 09:39
  • May be an idea to answer your own question. I gave an answer only to get a -1, interested in the lost node aspect. – thebluephantom Jul 02 '19 at 10:07

0 Answers0