Is there a way to tell before the write how many files will be created when saving Spark Dataframe as Delta Table in Azure Data Lake Storage Gen1?

Question

I am currently trying to save a Spark Dataframe to Azure Data Lake Storage (ADLS) Gen1. While doing so I recevie the following throttling error:

org.apache.spark.SparkException: Job aborted. Caused by: com.microsoft.azure.datalake.store.ADLException: Error creating file /user/DEGI/CLCPM_DATA/fraud_project/policy_risk_motorcar_with_lookups/part-00000-34d88646-3755-488d-af00-ef2e201240c8-c000.snappy.parquet
Operation CREATE failed with HTTP401 : null
Last encountered exception thrown after 2 tries. [HTTP401(null),HTTP401(null)]

I read in the documentation that the throttling occurs due to CREATE limits, which then causes the job to abord. The documentation also gives three reasons why this may happen.

Your application creates a large number of small files.
External applications create a large number of files.
The current limit for the subscription is too low.

While I do not think that my subscription is too low, I think it may be the case that my application is creating too many parquet files. Does anyone know how to tell how many files will be created when saving as table ? How can I find out the max number of files that I am allowed to create ?

The code that I use to create the table looks as follows:

df.write.format("delta").mode("overwrite").saveAsTable("database_name.df", path ='adl://my path to storage')

Also, I was able to write a smaller test dataframe without any problems.Plus The permissions of the folder in adls are set correctly.

Juh_ · Accepted Answer · 2021-10-14T16:13:20.263

1

The error you have doesn't look like an issue with number of file. 401 is an unauthorized issue. Nonetheless:

Spark writes at least as many file as there are partitions. So what you want is to do is repartition your dataframe. There are several repartition api, and to reduce partition and data distribution, it is recommended to use coalesce()

df.coalesce(10).write....

You can also read

edited Oct 14 '21 at 16:13

answered Oct 14 '21 at 12:11

Juh_

14,628
8
59
92

Even after `coalesce` it still fails. I think the unauthorized issue occurs because the write operation takes too long. Apparently it is a known [issue](https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/job-fails-adls-hour). I guess I will be restructuring my query... – DataBach Oct 14 '21 at 15:39
I expected that :\ – Juh_ Oct 14 '21 at 16:13
Yeah.. Thanks anyways ! – DataBach Oct 14 '21 at 18:10

Is there a way to tell before the write how many files will be created when saving Spark Dataframe as Delta Table in Azure Data Lake Storage Gen1?

1 Answers1