26

Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So

Group    Date
  A      2000
  A      2002
  A      2007
  B      1999
  B      2015

Would become

Group    Date    row_num
  A      2000      0
  A      2002      1
  A      2007      2
  B      1999      0
  B      2015      1
ZygD
  • 22,092
  • 39
  • 79
  • 102
Luke
  • 6,699
  • 13
  • 50
  • 88
  • Why all the hate? – Luke Aug 04 '17 at 19:40
  • 4
    Unfortunately, there is the wrong impression that a question **must** include code tested by yourself (and didn't work), although according to the SO guidelines for asking, this is certainly **not** the case: https://stackoverflow.com/help/on-topic – desertnaut Aug 04 '17 at 22:08

2 Answers2

38

Use window function:

from pyspark.sql.window import *
from pyspark.sql.functions import row_number

df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
desertnaut
  • 57,590
  • 26
  • 140
  • 166
user8419108
  • 396
  • 4
  • 2
  • Nice! I inserted a missing comma inside `withColumn`... :) – desertnaut Aug 04 '17 at 19:25
  • 5
    Welcome to SO and congrats for answering your first question! Keep on and don't get disappointed (it can be a harsh place occasionally...) - check also my edit to see how you can use code highlighting – desertnaut Aug 04 '17 at 20:04
  • @desertnaut can we do the ordering reserving the natural order of the dataframe rather than `orderby` ? – PolarBear10 Jul 30 '18 at 10:28
6

The accepted solution almost has it right. Here is the solution based on the output requested in the question:

df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"])

+-----+----+
|Group|Date|
+-----+----+
|    A|2000|
|    A|2002|
|    A|2007|
|    B|1999|
|    B|2015|
+-----+----+

# accepted solution above


from pyspark.sql.window import *
from pyspark.sql.functions import row_number

df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))


# accepted solution above output


+-----+----+-------+
|Group|Date|row_num|
+-----+----+-------+
|    B|1999|      1|
|    B|2015|      2|
|    A|2000|      1|
|    A|2002|      2|
|    A|2007|      3|
+-----+----+-------+

As you can see, the function row_number starts from 1 and not 0 and the requested question wanted to have the row_num starting from 0. Simple change like I have made below:

df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))-1).show()

Output :

+-----+----+-------+
|Group|Date|row_num|
+-----+----+-------+
|    B|1999|      0|
|    B|2015|      1|
|    A|2000|      0|
|    A|2002|      1|
|    A|2007|      2|
+-----+----+-------+

Then you can sort the "Group" column in whatever order you want. The above solution almost has it but it is important to remember that row_number begins with 1 and not 0.

Rahul P
  • 2,493
  • 2
  • 17
  • 31