Pyspark only writing '_temporary' folder when writing parquet

Question

I am trying to write a pyspark df to parquet like this:

df.write.format("parquet").\
mode('overwrite').\
save('gs://my_bucket/my_folder/filename')

This data frame has rows in the millions but I have been able to write a similar data frame before in a few minutes. However, this takes 30+ minutes, and I can only see _temporary/0/ under it, with nothing else.

I am able to easily write a small data frame and see that it works, but for some reason this one does not. There doesn't appear to be anything wrong with the data frame.

Could there be any other reason besides an issue with the data frame as to why it is taking forever and nothing is being written? Other similarly-sized data frames have had no issues.

Arran Duff · Answer 1 · 2021-12-08T11:59:01.363

1

Your files won't appear until the spark job is completed
Once your job has completed successfully you will see the files
This is explained here Spark _temporary creation reason
You may be able to see your final files being created inside the _temporary directory before they get moved to their final destination
However, remember that spark must complete all tasks in a stage before moving to the next stage. If one of your tasks gets stuck in a stage before the write stage, it may appear that your job has frozen and you will not see any files being written.
Your best bet for debugging this is to use the spark UI. It will provide nice visuals on the progress of all your tasks through the stages
The most common reason for tasks getting stuck is partition skew - where one task is doing much more work than the other tasks and therefore taking much longer to complete. But there are also other reasons why your job may apear frozen. Again the spark UI is really the best/only way to get a good understanding of how your job is progressing
In any event Spark UI is always helpful to understand bottlenecks or stalled jobs

edited Dec 08 '21 at 11:59

answered Dec 07 '21 at 18:19

Arran Duff

1,214
2
11
23

The issue is on other data frames I see the temporary folder but I see the parquet parts being added under it. For this one, there is just the temporary folder, nothing else. I've written similar sized dfs but this seems like it's not doing anything at all. And no error message. – formicaman Dec 07 '21 at 18:21
Unless you have some error logs, then it's hard to say what the issue is. Your best bet is to have a look at the spark UI and see if you can see where the bottleneck is. You might have a lot of skew in your data where one task isn't completing – Arran Duff Dec 07 '21 at 18:24
3

If your job is stuck on a particular task - say for a very large partition. Then, it's very likely that no data will be written until that task completes. The spark UI is the best way to debug this as it will give you real time info around how your tasks are proceeding. You get handy visuals to see whether there are some partitions that are taking unusually long to complete – Arran Duff Dec 07 '21 at 18:28
It is important to know where the bottleneck and the error logs would help to understand whats causing the problem. As recommended by @Arran Duff, use the spark UI to see if any clusters of data are causing the issue and try to monitor the time it takes to process the data, some of the data might take longer than expected or cannot be processed. – Andres Fiesco Casasola Dec 07 '21 at 23:32
1

Have updated the answer with these extra comments – Arran Duff Dec 08 '21 at 11:57

Pyspark only writing '_temporary' folder when writing parquet

1 Answers1