Spark Dataset unique id performance - row_number vs monotonically_increasing_id

Question

I want to assign a unique Id to my dataset rows. I know that there are two implementation options:

First option:

import org.apache.spark.sql.expressions.Window;
ds.withColumn("id",row_number().over(Window.orderBy("a column")))

Second option:

df.withColumn("id", monotonically_increasing_id())

The second option is not sequential ID and it doesn't really matter.

I'm trying to figure out is if there are any performance issues of those implementation. That is, if one of this option is very slow compared to the other. Something more meaningful that: "monotonically_increasing_id is very fast over row_number because it's not sequential or ..."

Ramesh Maharjan · Accepted Answer · 2018-01-29T11:55:13.977

17

monotically_increasing_id is distributed which performs according to partition of the data.

whereas

row_number() using Window function without partitionBy (as in your case) is not distributed. When we don't define partitionBy, all the data are sent to one executor for generating row number.

Thus, it is certain that monotically_increasing_id() will perform better than row_number() without partitionBy defined.

edited Jan 29 '18 at 11:55

answered Jan 29 '18 at 11:49

Ramesh Maharjan

41,071
6
69
97

score 12 · Answer 2 · answered Jan 29 '18 at 11:52

12

TL;DR It is not even a competition.

Never use:

row_number().over(Window.orderBy("a column"))

for anything else than summarizing results, that already fit in a single machine memory.

To apply window function without PARTITION BY Spark has to shuffle all data into a single partition. On any large dataset this will just crash the application. Sequential and not distributed won't even matter.

answered Jan 29 '18 at 11:52

Alper t. Turker

34,230
9
83
115

What shall I use? I need to order by certain cols, please help. – Liu Yu Nov 09 '20 at 06:40

Spark Dataset unique id performance - row_number vs monotonically_increasing_id

2 Answers2

Linked