0

in Pyspark, I have a dataframe spark in this format :

CODE  |  TITLE  |  POSITION
A     |  per    |   1
A     |  eis    |   3
A     |  fon    |   4
A     |  dat    |   5
B     |  jem    |   2
B     |  neu    |   3
B     |  tri    |   5
B     |  nok    |   6

and I want to have that :

CODE  |  TITLE  |  POSITION
A     |  per    |   1
A     |  eis    |   2
A     |  fon    |   3
A     |  dat    |   4
B     |  jem    |   1
B     |  neu    |   2
B     |  tri    |   3
B     |  nok    |   4

the idea is that the column position starts at 1, and for example for the CODE A, it starts with 1 and I have the position 2 missing, then I need to make 3-1=>2, 4-1=>3 and 5=>4

how can we do that in pyspark ?

thank you for your help

1 Answers1

0

With a slightly simpler dataframe

df.show()
+----+-----+--------+
|CODE|TITLE|POSITION|
+----+-----+--------+
|   A|   AA|       1|
|   A|   BB|       3|
|   A|   CC|       4|
|   A|   DD|       5|
|   B|   EE|       2|
|   B|   FF|       3|
|   B|   GG|       5|
|   B|   HH|       6|
+----+-----+--------+

from pyspark.sql.functions import row_number
from pyspark.sql.window import Window

df.withColumn('POSITION', row_number().over(Window.partitionBy('CODE').orderBy('POSITION'))).show()
+----+-----+--------+
|CODE|TITLE|POSITION|
+----+-----+--------+
|   B|   EE|       1|
|   B|   FF|       2|
|   B|   GG|       3|
|   B|   HH|       4|
|   A|   AA|       1|
|   A|   BB|       2|
|   A|   CC|       3|
|   A|   DD|       4|
+----+-----+--------+
Michael Szczesny
  • 4,911
  • 5
  • 15
  • 32