0

I'm trying to find an equivalent for the following snippet (reference) to create unique id to every unique combination from two columns in PySpark.

Pandas approach:

df['my_id'] = df.groupby(['foo', 'bar'], sort=False).ngroup() + 1

I tried the following, but it's creating more ids than required:

df = df.withColumn("my_id", F.row_number().over(Window.orderBy('foo', 'bar')))
ZygD
  • 22,092
  • 39
  • 79
  • 102
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77

1 Answers1

1

Instead of row_number, use dense_rank:

from pyspark.sql import functions as F, Window
df = spark.createDataFrame(
    [('r1', 'ph1'),
     ('r1', 'ph1'),
     ('r1', 'ph2'),
     ('s4', 'ph3'),
     ('s3', 'ph2'),
     ('s3', 'ph2')],
    ['foo', 'bar'])

df = df.withColumn("my_id", F.dense_rank().over(Window.orderBy('foo', 'bar')))
df.show()
# +---+---+-----+
# |foo|bar|my_id|
# +---+---+-----+
# | r1|ph1|    1|
# | r1|ph1|    1|
# | r1|ph2|    2|
# | s3|ph2|    3|
# | s3|ph2|    3|
# | s4|ph3|    4|
# +---+---+-----+
ZygD
  • 22,092
  • 39
  • 79
  • 102