0

I am running parser jobs to parse json files and load data from them to HIVE tables, I am using python (pySpark) to create first create DataFrames, collect data from json files, to load one bulk load data to HIVE tables. There is no issue when I am processing with (300 to 500) json files which is approx. loading 2 to 4 millions records to HIVE tables with process time of approx. 24 mins to 34 mins. When we increate number to 1000 json files it start taking 3 hours to process files and load data (~9 millions ) to HIVE tables, as we increase json files number system slow down dramatically up to 22 hours for (8000 to 9000) json files, may be 84 millions in volume ...but job fails with error of

: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 60561 for dhpxxxx) can't be found in cache

Here are the different parameter values during execution.

--deploy-mode client --driver-memory 50g --conf spark.driver.maxResultSize=12g --executor-cores 4 --executor-memory 25g --num-executors 100

This is how I am submitting my code.

/xxx/xxx/current_loaction/spark2-client/bin/spark-submit --master yarn --deploy-mode client --driver-memory 50g --conf spark.driver.maxResultSize=12g --files /xxx/xxx/current_location/spark2-client/yyyy/hive-site.xml --executor-cores 4 --executor-memory 25g --num-executors 100 process_multi_files.py

Is there a way to increase the performance of current parameters, remined that other users are also running there jobs on Hadoop cluster. Total number of active nodes are 27, Memory Total is ~ 4.50TB

S M
  • 101
  • 3
  • 16

1 Answers1

0

Since the performance seems directly correlated to the number of files. I would try to reduce the number of json files by introducing a pre-processing step to merge json files before I read them to be transformed into hive tables. Compressing the files to gz might also help. (How to read gz compressed file by pyspark)

Dannie
  • 2,430
  • 14
  • 16