Create sequential unique id for each group

Question

I'm trying to find an equivalent for the following snippet (reference) to create unique id to every unique combination from two columns in PySpark.

Pandas approach:

df['my_id'] = df.groupby(['foo', 'bar'], sort=False).ngroup() + 1

I tried the following, but it's creating more ids than required:

df = df.withColumn("my_id", F.row_number().over(Window.orderBy('foo', 'bar')))

score 1 · Accepted Answer · answered Aug 29 '22 at 14:42

1

Instead of row_number, use dense_rank:

from pyspark.sql import functions as F, Window
df = spark.createDataFrame(
    [('r1', 'ph1'),
     ('r1', 'ph1'),
     ('r1', 'ph2'),
     ('s4', 'ph3'),
     ('s3', 'ph2'),
     ('s3', 'ph2')],
    ['foo', 'bar'])

df = df.withColumn("my_id", F.dense_rank().over(Window.orderBy('foo', 'bar')))
df.show()
# +---+---+-----+
# |foo|bar|my_id|
# +---+---+-----+
# | r1|ph1|    1|
# | r1|ph1|    1|
# | r1|ph2|    2|
# | s3|ph2|    3|
# | s3|ph2|    3|
# | s4|ph3|    4|
# +---+---+-----+

answered Aug 29 '22 at 14:42

ZygD

22,092
39
79
102

Could you please add, why the row_number is not working good? thanks – Venkatachalam Aug 30 '22 at 04:19
1

`row_number` just blindly counts every row. It doesn't compare values in the current row with values in previous row. – ZygD Aug 30 '22 at 04:53

Create sequential unique id for each group

1 Answers1