I have run the following PySpark code:
from pyspark import SparkContext
sc = SparkContext()
data = sc.textFile('gs://bucket-name/input_blob_path')
sorted_data = data.sortBy(lambda x: sort_criteria(x))
sorted_data.saveAsTextFile(
'gs://bucket-name/output_blob_path',
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
)
Job finished successfully. However, during the job execution Spark created many temporary blobs in the following path gs://bucket-name/output_blob_path/_temporary/0/
. I realised that removing of all these temporary blobs at the end took half of the job execution time and CPU utilisation was on 1% during this time (huge waste of resources).
Is there a way to store temporary files on local drive (or HDFS) instead of GCP? I would still like to persist final results (sorted dataset) to GCP.
We were using Dataproc Spark cluster (VM type 16cores, 60GM) with 10 worker nodes. The volume of the input data was 10TB.