0

I have a dataframe, as follows:

customer_id  order_ts order_nbr item_nbr
162038  04/04/23 18:42  1258    972395
162038  04/04/23 18:42  1258    551984
162038  04/04/23 18:42  1258    488298
162038  04/04/23 18:42  1258    649230
162038  26/02/23 16:28  2715    372225
162038  26/02/23 16:28  2715    911716
162038  26/02/23 16:28  2715    696677
162038  26/02/23 16:28  2715    229455
162038  26/02/23 16:28  2715    870016
162038  29/01/23 13:07  1171    113719
162038  29/01/23 13:07  1171    553461
162060  01/05/23 18:42  1259    300911
162060  01/05/23 18:42  1259    574962
162060  01/05/23 18:42  1259    843300
162060  01/05/23 18:42  1259    173719
162060  05/05/23 18:42  2719    254899
162060  05/05/23 18:42  2719    776553
162060  05/05/23 18:42  2719    244739
162060  05/05/23 18:42  2719    170742
162060  05/05/23 18:42  2719    525719
162060  10/05/23 18:42  1161    896919
162060  10/05/23 18:42  1161    759465

I am interested to create a column "row_num" as follows:

customer_id  order_ts order_nbr item_nbr    row_num
162038  04/04/23 18:42  1258    972395  1
162038  04/04/23 18:42  1258    551984  1
162038  04/04/23 18:42  1258    488298  1
162038  04/04/23 18:42  1258    649230  1
162038  26/02/23 16:28  2715    372225  2
162038  26/02/23 16:28  2715    911716  2
162038  26/02/23 16:28  2715    696677  2
162038  26/02/23 16:28  2715    229455  2
162038  26/02/23 16:28  2715    870016  2
162038  29/01/23 13:07  1171    113719  3
162038  29/01/23 13:07  1171    553461  3
162060  01/05/23 18:42  1259    300911  1
162060  01/05/23 18:42  1259    574962  1
162060  01/05/23 18:42  1259    843300  1
162060  01/05/23 18:42  1259    173719  1
162060  05/05/23 18:42  2719    254899  2
162060  05/05/23 18:42  2719    776553  2
162060  05/05/23 18:42  2719    244739  2
162060  05/05/23 18:42  2719    170742  2
162060  05/05/23 18:42  2719    525719  2
162060  10/05/23 18:42  1161    896919  3
162060  10/05/23 18:42  1161    759465  3

"row_num" column is basically unique for a particular customer_id, order_ts, order_nbr. For example, customer_id 162038, order_ts = 04/04/23 18:42, order_nbr = 1258, row_num = 1 for all the item numbers.

I did as follows:

windowDept = Window.partitionBy("customer_id","order_ts","order_nbr").orderBy(col("order_ts").desc())
df1 = df.withColumn("row_num",row_number().over(windowDept))

But I am getting this as output:

enter image description here

Any help will be highly appreciated. Thanks.

Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61
user3642360
  • 762
  • 10
  • 23

1 Answers1

1

You are not partitioning correctly, you should partition by customer_id then order by the 3 columns to maintains the order that you specified:

window_spec = Window.partitionBy("customer_id").orderBy("order_ts", "order_nbr")
df = df.withColumn("row_num", dense_rank().over(window_spec))
df.show()

Result:

+-----------+--------------+---------+--------+-------+
|customer_id|      order_ts|order_nbr|item_nbr|row_num|
+-----------+--------------+---------+--------+-------+
|     162038|04/04/23 18:42|     1258|  972395|      1|
|     162038|04/04/23 18:42|     1258|  551984|      1|
|     162038|04/04/23 18:42|     1258|  488298|      1|
|     162038|04/04/23 18:42|     1258|  649230|      1|
|     162038|26/02/23 16:28|     2715|  372225|      2|
|     162038|26/02/23 16:28|     2715|  911716|      2|
|     162038|26/02/23 16:28|     2715|  696677|      2|
|     162038|26/02/23 16:28|     2715|  229455|      2|
|     162038|26/02/23 16:28|     2715|  870016|      2|
|     162038|29/01/23 13:07|     1171|  113719|      3|
|     162038|29/01/23 13:07|     1171|  553461|      3|
|     162060|01/05/23 18:42|     1259|  300911|      1|
|     162060|01/05/23 18:42|     1259|  574962|      1|
|     162060|01/05/23 18:42|     1259|  843300|      1|
|     162060|01/05/23 18:42|     1259|  173719|      1|
|     162060|05/05/23 18:42|     2719|  254899|      2|
|     162060|05/05/23 18:42|     2719|  776553|      2|
|     162060|05/05/23 18:42|     2719|  244739|      2|
|     162060|05/05/23 18:42|     2719|  170742|      2|
|     162060|05/05/23 18:42|     2719|  525719|      2|
|     162060|10/05/23 18:42|     1161|  896919|      3|
|     162060|10/05/23 18:42|     1161|  759465|      3|
+-----------+--------------+---------+--------+-------+

EDIT:

Also, you can delete one of the 2 orderBy columns "order_ts" or "order_nbr" if you are sure that one of them is sufficient to maintain the order, so you window will be like this:

window_spec = Window.partitionBy("customer_id").orderBy("order_ts")
Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61