I am running parser jobs to parse json files and load data from them to HIVE tables, I am using python (pySpark) to create first create DataFrames, collect data from json files, to load one bulk load data to HIVE tables. There is no issue when I am processing with (300 to 500) json files which is approx. loading 2 to 4 millions records to HIVE tables with process time of approx. 24 mins to 34 mins. When we increate number to 1000 json files it start taking 3 hours to process files and load data (~9 millions ) to HIVE tables, as we increase json files number system slow down dramatically up to 22 hours for (8000 to 9000) json files, may be 84 millions in volume ...but job fails with error of
: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 60561 for dhpxxxx) can't be found in cache
Here are the different parameter values during execution.
--deploy-mode client --driver-memory 50g --conf spark.driver.maxResultSize=12g --executor-cores 4 --executor-memory 25g --num-executors 100
This is how I am submitting my code.
/xxx/xxx/current_loaction/spark2-client/bin/spark-submit --master yarn --deploy-mode client --driver-memory 50g --conf spark.driver.maxResultSize=12g --files /xxx/xxx/current_location/spark2-client/yyyy/hive-site.xml --executor-cores 4 --executor-memory 25g --num-executors 100 process_multi_files.py
Is there a way to increase the performance of current parameters, remined that other users are also running there jobs on Hadoop cluster. Total number of active nodes are 27, Memory Total is ~ 4.50TB