Shall we do repartition() operation, If I am using spark dynamic resource allocation strategy

Question

We are running Spark job in Dataproc. Some time few jobs gets stuck and it does not complete. We have to manually kill the job from Dataproc.

We are suspecting that repartition(50) could be a problem, As we are using dynamic resource allocation for jobs.

We have defined these configuration for Spark:

'spark.dynamicAllocation.minExecutors': 5
'spark.dynamicAllocation.enabled': 'true',
'spark.cleaner.referenceTracking.cleanCheckpoints': 'true',
'spark.rdd.compress': 'true',
'spark.locality.wait': '0s',
'spark.executor.cores': '1',
'spark.executor.memory': '3g',
'spark.eventLog.compress': 'true',
'spark.history.fs.cleaner.enabled': 'true',
'spark.history.fs.cleaner.maxAge':  '1h',
'spark.history.fs.cleaner.interval': '5m',
'spark.driver.extraClassPath': '$SPARK_HOME/jars/kafka-clients-1.1.0.jar'

self.create_rdd("/path/to/file.txt.gz")
    .repartition(50) \
    .mapPartitions(partial() \
    .groupBy(itemgetter('id'), numPartitions=NUM_PARTITIONS) \
    .mapPartitions(partial()) \
    .count()

File size varies from 100mb to 1gb. No of files varies from 10 -300 files — Aditya, Apr 30 '19 at 06:10
repartitioning will do data shuffle across nodes, which is a very expensive operation. — Remis Haroon - رامز, Apr 30 '19 at 07:07
please provide more information like how many nodes? containers — Remis Haroon - رامز, Apr 30 '19 at 07:09
Try this-> 1) increase number of executors 2) Increase executors cores to 2 3) remove repartition(50) & mapPartitions() from your code -> Try -> self.create_rdd("/path/to/file.txt.gz").toDF().groupBy("id").count() — MIKHIL NAGARALE, Apr 30 '19 at 11:21
Considering compressed file size as 1 GB (worst case) so uncompressed file size will be equal or more than 10 GB. if default partition size in HDFS is 128MB then uncompressed file(~10GB) will have approx 80 partitions (minimum). Since `repartition()` will cause the shuffle & you want to reduce the number of partitions to 50 so it's better to use `coalesce(50)`. It'll merge the partitions & won't cause the shuffle. — MIKHIL NAGARALE, Apr 30 '19 at 11:34

Shall we do repartition() operation, If I am using spark dynamic resource allocation strategy

0 Answers0