I am working in Spark Project
since last 3-4 months and recently.
I am doing some calculation with a huge history file (800 GB) and a small incremental file (3 GB).
The calculation is happening very fast in spark using hqlContext
& dataframe
, but when I am trying to write the calculated result as a hive table
with orc
format which will contain almost 20 billion of records with a data size of almost 800 GB is taking too much time (more than 2 hours and finally getting failed).
My cluster details are: 19 nodes , 1.41 TB of Total Memory, Total VCores are 361.
For tuneup I am using
--num-executors 67
--executor-cores 6
--executor-memory 60g
--driver-memory 50g
--driver-cores 6
--master yarn-cluster
--total-executor-cores 100
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"
at run time.
If I take a count of result, then it is completing within 15 minutes, but if I want to write that result in HDFS as hive table.
[ UPDATED_RECORDS.write.format("orc").saveAsTable("HIST_ORC_TARGET") ]
then I am facing the above issue.
Please provide me with a suggestion or anything regarding this as I am stuck in this case since last couple of days.
Code format:
val BASE_RDD_HIST = hqlContext.sql("select * from hist_orc")
val BASE_RDD_INCR = hqlContext.sql("select * from incr_orc")
some spark calculation using dataframe, hive query & udf.....
Finally:
result.write.format("orc").saveAsTable("HIST_ORC_TARGET_TABLE")