We are running Spark job in Dataproc. Some time few jobs gets stuck and it does not complete. We have to manually kill the job from Dataproc.
We are suspecting that repartition(50)
could be a problem, As we are using dynamic resource allocation for jobs.
We have defined these configuration for Spark:
'spark.dynamicAllocation.minExecutors': 5
'spark.dynamicAllocation.enabled': 'true',
'spark.cleaner.referenceTracking.cleanCheckpoints': 'true',
'spark.rdd.compress': 'true',
'spark.locality.wait': '0s',
'spark.executor.cores': '1',
'spark.executor.memory': '3g',
'spark.eventLog.compress': 'true',
'spark.history.fs.cleaner.enabled': 'true',
'spark.history.fs.cleaner.maxAge': '1h',
'spark.history.fs.cleaner.interval': '5m',
'spark.driver.extraClassPath': '$SPARK_HOME/jars/kafka-clients-1.1.0.jar'
self.create_rdd("/path/to/file.txt.gz")
.repartition(50) \
.mapPartitions(partial() \
.groupBy(itemgetter('id'), numPartitions=NUM_PARTITIONS) \
.mapPartitions(partial()) \
.count()