I am using the collect_set method on a DataFrame and adding 3 columns.
My df is as below:
id acc_no acc_name cust_id
1 111 ABC 88
1 222 XYZ 99
Below is the code snippet:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy('id').orderBy('acc_no')
df1 = df.withColumn(
'cust_id_new',
F.collect_set(cust_id).over(w)
).withColumn(
'acc_no_new',
F.collect_set(acc_no).over(w)
).withColumn(
'acc_name_new',
F.collect_set(acc_name).over(w)
).drop('cust_id').drop('acc_no').drop('acc_name')
In this case, my output is as follows:
id acc_no acc_name cust_id
1 [111,222] [XYZ,ABC] [88,99]
So here, the acc_no and cust_id are correct, but the order of acc_name is incorrect. acc_no 111 corresponds to acc_name ABC, but we are getting XYZ.
Can someone please let me know why this is happening and what would be the solution ?
I suspect this issue is occurring for string column only, but i might be wrong. Please help...
This is similar to below thread, but I am still getting an error.
How to maintain sort order in PySpark collect_list and collect multiple lists