I want to assign a unique Id to my dataset rows. I know that there are two implementation options:
First option:
import org.apache.spark.sql.expressions.Window; ds.withColumn("id",row_number().over(Window.orderBy("a column")))
Second option:
df.withColumn("id", monotonically_increasing_id())
The second option is not sequential ID and it doesn't really matter.
I'm trying to figure out is if there are any performance issues of those implementation. That is, if one of this option is very slow compared to the other. Something more meaningful that: "monotonically_increasing_id is very fast over row_number because it's not sequential or ..."