GroupBy Horizontal Stacking in PySpark Dataframe

Question

I have a PySpark dataframe like this:

cust_id prod
      1    A
      1    B
      1    C
      2    D
      2    E
      2    F

Desired Output:

cust_id   prod
      1  A/B/C
      2  D/E/F

Now using Pandas I am able to do it like below:

T=df.groupby(['cust_id'])['prod'].apply(lambda x:np.hstack(x)).reset_index()

def func_x(ls):
    n=len(ls)
    s=''
    for i in range(n):
        if n-i==1:
            s=s+ls[i]
        else:
            s=s+ls[i]+'/'
    return s

T['prod1']=T['prod'].apply(lambda x:func_x(x))

What will be this code's equivalent in PySpark?

you need to use `collect_list` and `concat_ws`. Something like `df.groupBy("cust_id").agg(concat_ws("/", collect_list("prod")))` - let me find a dupe — pault, Jul 18 '19 at 13:40

score 0 · Accepted Answer · answered Jul 18 '19 at 13:57

0

import pyspark.sql.functions as F
separator = '/'
T = df.groupby('cust_id').agg(F.concat_ws(separator, f.collect_list(df.col2)))

answered Jul 18 '19 at 13:57

Sequinex

641
1
9
17

GroupBy Horizontal Stacking in PySpark Dataframe

1 Answers1