I have an input file of size 260GB and my spark cluster memory capacity is 140 gb, upon running my spark job will, the excess data of 120 be stored to disk by default or should I use some storage levels to specify it.
I have not tried any solutions to solve this issue.
def main(args: Array[String]){
val conf:SparkConf = new SparkConf().setAppName("optimize_1").setMaster("local")
val sc:SparkContext = new SparkContext(conf)
val myRDD = sc.parallelize( List(("1", "abc", "Request"), ("1", "cba", "Response"), ("2", "def", "Request"), ("2", "fed", "Response"), ("3", "ghi", "Request"), ("3", "ihg", "Response")) )
val myRDD_1 = sc.parallelize( List(("1", "abc"), ("1", "cba"), ("2", "def"), ("2", "fed"), ("3", "ghi"), ("3", "ihg")) )
myRDD_1.map(x=>x).groupBy(_._1).take(10).foreach(println)
myRDD_1.groupByKey().foreach(println) }
Below is the expected and working output for small data:
(2,CompactBuffer(def, fed))
(3,CompactBuffer(ghi, ihg))
(1,CompactBuffer(abc, cba))
But when applying it on a large scale I am receiving the following error:
"Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@DOSSPOCVM1:33303 --executor-id 8 --hostname DOSSPOCVM1 --cores 1 --app-id application_1555417914353_0069 --user-class-path file:$PWD/app.jar 1>/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009/stdout 2>/hadoop/yarn/log/application_1555417914353_0069/container_e05_1555417914353_0069_02_000009/stderr""
ERROR YarnClusterScheduler: Lost executor 17 on DOSSPOCVM2: Container marked as failed: container_e05_1555417914353_0069_02_000019 on host: DOSSPOCVM2. Exit status: -100. Diagnostics: Container released on a lost node
Please suggest a way to resolve this issue