Generate deterministic ID column on Spark

Question

I use the Spark window function row_number() to generate an ID for a complex DataFrame with nested structures. Afterwards, I extract parts of the DataFrame to create multiple tables as an output, which include this key.

However, Spark only materializes the table once an action is triggered, so it would generate the ID in the end when the extracted table is saved in HDFS. On the other hand, when dealing with large DataFrames and transformations, Spark may shuffle the data and consequently change the possible values that row_number() would generate.

Since I generate multiple tables from a single DataFrame, I need the ID column to remain consistent throughout the tables though, which means it needs to be generated once before extracting the tables, and not dynamically for each output.

The originating logic for this is from Would a forced Spark DataFrame materialization work as a checkpoint? which explains the root issue in more detail.

But here my question is: how do I create such an ID column only once and store it as a fixed value, and then use it for extracting various tables from the DataFrame, without risking the ID column to be generated by the lineage at the end of each extraction?

score 0 · Answer 1 · answered Mar 17 '19 at 13:30

You do not have much code to work with, so it's a bit difficult to give a more precise answer, but you can try [monotonically_increasing_id()][1].

Abstract form its Javadoc:

[adds] a column expression that generates monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

Generate deterministic ID column on Spark

1 Answers1