5

I'd like to enumerate grouped values just like with Pandas:

Enumerate each row for each group in a DataFrame

What is a way in Spark/Python?

Community
  • 1
  • 1
Gere
  • 12,075
  • 18
  • 62
  • 94

2 Answers2

4

With row_number window function:

from pyspark.sql.functions import row_number
from pyspark.sql import Window

w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))
zero323
  • 322,348
  • 103
  • 959
  • 935
1

You can achieve this on rdd level by doing:

rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()

It will result: +---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+ If you only need unique ID, not real continuous indexing, you may also use zipWithUniqueId() which is more efficient, since done locally on each partition.

Elior Malul
  • 683
  • 6
  • 8