Slow performance with pyspark on Hadoop

Question

I am running parser jobs to parse json files and load data from them to HIVE tables, I am using python (pySpark) to create first create DataFrames, collect data from json files, to load one bulk load data to HIVE tables. There is no issue when I am processing with (300 to 500) json files which is approx. loading 2 to 4 millions records to HIVE tables with process time of approx. 24 mins to 34 mins. When we increate number to 1000 json files it start taking 3 hours to process files and load data (~9 millions ) to HIVE tables, as we increase json files number system slow down dramatically up to 22 hours for (8000 to 9000) json files, may be 84 millions in volume ...but job fails with error of

: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 60561 for dhpxxxx) can't be found in cache

Here are the different parameter values during execution.

--deploy-mode client --driver-memory 50g --conf spark.driver.maxResultSize=12g --executor-cores 4 --executor-memory 25g --num-executors 100

This is how I am submitting my code.

/xxx/xxx/current_loaction/spark2-client/bin/spark-submit --master yarn --deploy-mode client --driver-memory 50g --conf spark.driver.maxResultSize=12g --files /xxx/xxx/current_location/spark2-client/yyyy/hive-site.xml --executor-cores 4 --executor-memory 25g --num-executors 100 process_multi_files.py

Is there a way to increase the performance of current parameters, remined that other users are also running there jobs on Hadoop cluster. Total number of active nodes are 27, Memory Total is ~ 4.50TB

are you using HDFS to populate HIVE table or you loading directly to HIVE? — iurii_n, Oct 01 '18 at 15:56
also could you show us the code of `process_multi_files.py`? — iurii_n, Oct 01 '18 at 16:01
Yes, files are in HDFS location, python code is reading those files which parser those files to create 4 different Dataframes, collect all data into dataframes and load that data to 4 HIVE tables. — S M, Oct 01 '18 at 16:02

score 0 · Answer 1 · answered Oct 01 '18 at 17:58

Since the performance seems directly correlated to the number of files. I would try to reduce the number of json files by introducing a pre-processing step to merge json files before I read them to be transformed into hive tables. Compressing the files to gz might also help. (How to read gz compressed file by pyspark)

Slow performance with pyspark on Hadoop

1 Answers1