I use the Spark window function row_number()
to generate an ID for a complex DataFrame with nested structures. Afterwards, I extract parts of the DataFrame to create multiple tables as an output, which include this key.
However, Spark only materializes the table once an action is triggered, so it would generate the ID in the end when the extracted table is saved in HDFS. On the other hand, when dealing with large DataFrames and transformations, Spark may shuffle the data and consequently change the possible values that row_number()
would generate.
Since I generate multiple tables from a single DataFrame, I need the ID column to remain consistent throughout the tables though, which means it needs to be generated once before extracting the tables, and not dynamically for each output.
The originating logic for this is from Would a forced Spark DataFrame materialization work as a checkpoint? which explains the root issue in more detail.
But here my question is: how do I create such an ID column only once and store it as a fixed value, and then use it for extracting various tables from the DataFrame, without risking the ID column to be generated by the lineage at the end of each extraction?