Access unique key (writeJobUUID) in parquet filename when saving PySpark DataFrame

Question

I'm looking for a way to access the unique part(s) of the parquet filename when saving a Spark DataFrame as Parquet with PySpark.

Just read in Change output filename prefix for DataFrame.write() that changing the output filename prefix for DataFrame.write() is not possible, though I like to know if there is a way to access the values used in RecordWriter to build up the filename.

I had a look at the a source code, and saw that it's configuration.get("spark.sql.sources.writeJobUUID"), does this property gets initialized earlier, and is it also accessible through PySpark?

I'd like to use it for logging purposes, to match a specific Spark job to the parquet files written (so I can e.g. remove all output by a specific job in different output partitions).

You're probably better off adding some kind of JobID to your _data_, and _partitioning_ by that column - that way each job would create its own partitions which you can later read / write / delete at will, without having to dig into Parquet internals. — Tzach Zohar, Apr 02 '16 at 07:46
Thanks Tzach, that is indeed not a bad idea though we are already using a lot of partitioning and were keen on exploring this option. I guess it's not that straight-forward then.. — Base_v, Apr 04 '16 at 08:06

Access unique key (writeJobUUID) in parquet filename when saving PySpark DataFrame

0 Answers0