I am trying to save a DataFrame
to HDFS in Parquet format using DataFrameWriter
, partitioned by three column values, like this:
dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path)
As mentioned in this question, partitionBy
will delete the full existing hierarchy of partitions at path
and replaced them with the partitions in dataFrame
. Since new incremental data for a particular day will come in periodically, what I want is to replace only those partitions in the hierarchy that dataFrame
has data for, leaving the others untouched.
To do this it appears I need to save each partition individually using its full path, something like this:
singlePartition.write.mode(SaveMode.Overwrite).parquet(path + "/eventdate=2017-01-01/hour=0/processtime=1234567890")
However I'm having trouble understanding the best way to organize the data into single-partition DataFrame
s so that I can write them out using their full path. One idea was something like:
dataFrame.repartition("eventdate", "hour", "processtime").foreachPartition ...
But foreachPartition
operates on an Iterator[Row]
which is not ideal for writing out to Parquet format.
I also considered using a select...distinct eventdate, hour, processtime
to obtain the list of partitions, and then filtering the original data frame by each of those partitions and saving the results to their full partitioned path. But the distinct query plus a filter for each partition doesn't seem very efficient since it would be a lot of filter/write operations.
I'm hoping there's a cleaner way to preserve existing partitions for which dataFrame
has no data?
Thanks for reading.
Spark version: 2.1