Pyspark dataframe write parquet without deleting /_temporary folder

Question

df.write.mode("append").parquet(path)

I'm using this to write parquet files to an S3 location. It seems like in order to write the files, it's also creating a /_temporary directory and deleting it after use. So I got access denied. Admin on our AWS account doesn't want to grant the code delete permission on that folder.

I proposed to write the files to another folder where delete permission can be granted then copy the files over. But Admin still wants me to write files directly to the destination folder.

Is there certain configuration I can set to ask Pyspark don't do the deletion on the temporary directory?

How are you running this on AWS. What's the name of the service — Arun Kamalanathan, Dec 10 '19 at 06:21

score 0 · Answer 1 · answered Dec 10 '19 at 07:20

I don't think there is such option for _temporary folder.

But if you're running your Spark job on EMR cluster, you can write first to local HDFS of your cluster and then copy data to S3 using Hadoop FileUtil.copy function.

On Pyspark, you can access this function via JVM gateway like this :

sc._gateway.jvm.org.apache.hadoop.fs.FileUtil

Pyspark dataframe write parquet without deleting /_temporary folder

1 Answers1

Linked