How to write Huge Data ( almost 800 GB) as a hive orc table in HDFS using SPARK?

Question

I am working in Spark Project since last 3-4 months and recently.

I am doing some calculation with a huge history file (800 GB) and a small incremental file (3 GB).

The calculation is happening very fast in spark using hqlContext & dataframe, but when I am trying to write the calculated result as a hive table with orc format which will contain almost 20 billion of records with a data size of almost 800 GB is taking too much time (more than 2 hours and finally getting failed).

My cluster details are: 19 nodes , 1.41 TB of Total Memory, Total VCores are 361.

For tuneup I am using

--num-executors 67
--executor-cores 6
--executor-memory 60g
--driver-memory 50g
--driver-cores 6
--master yarn-cluster
--total-executor-cores 100
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"

at run time.

If I take a count of result, then it is completing within 15 minutes, but if I want to write that result in HDFS as hive table.

[ UPDATED_RECORDS.write.format("orc").saveAsTable("HIST_ORC_TARGET") ]

then I am facing the above issue.

Please provide me with a suggestion or anything regarding this as I am stuck in this case since last couple of days.

Code format:

val BASE_RDD_HIST = hqlContext.sql("select * from hist_orc")
val BASE_RDD_INCR = hqlContext.sql("select * from incr_orc")

some spark calculation using dataframe, hive query & udf.....

Finally:

result.write.format("orc").saveAsTable("HIST_ORC_TARGET_TABLE")

When your project fails, what's the error message? Is it Out of Memory? If so, these tuning tips may be helpful: http://stackoverflow.com/questions/21138751/spark-java-lang-outofmemoryerror-java-heap-space — vkuo, Jun 26 '16 at 19:41
16/06/30 13:58:19 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to g4t7550.houston.hpecorp.net:53074 16/06/30 13:58:41 WARN TaskSetManager: Lost task 22.0 in stage 2.0 (TID 918, g4t7567.houston.hpecorp.net): java.lang.OutOfMemoryError: Java heap space at org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:66) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:259) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) at — Pelab, Jul 01 '16 at 15:29

score 2 · Answer 1 · edited Aug 02 '16 at 21:25

Hello friends I have found the answer of my own question few days back so here I am writing that.

Whenever we execute any spark program we do not specify the queue parameter and some time the default queue has some limitations which does not allow you to execute as many executors or tasks that you want so it might cause a slow processing and later on a cause of job failure for memory issue as you are running less executors/tasks. So don't forget to mention a queue name at in your execution command:

spark-submit --class com.xx.yy.FactTable_Merging.ScalaHiveHql
    --num-executors 25
    --executor-cores 5
    --executor-memory 20g
    --driver-memory 10g
    --driver-cores 5
    --master yarn-cluster
    --name "FactTable HIST & INCR Re Write After Null Merging Seperately"
    --queue "your_queue_name"
    /tmp/ScalaHiveProgram.jar
    /user/poc_user/FactTable_INCR_MERGED_10_PARTITION
    /user/poc_user/FactTable_HIST_MERGED_50_PARTITION

How to write Huge Data ( almost 800 GB) as a hive orc table in HDFS using SPARK?

1 Answers1