-1

I have a PySpark dataframe like this:

cust_id prod
      1    A
      1    B
      1    C
      2    D
      2    E
      2    F

Desired Output:

cust_id   prod
      1  A/B/C
      2  D/E/F

Now using Pandas I am able to do it like below:

T=df.groupby(['cust_id'])['prod'].apply(lambda x:np.hstack(x)).reset_index()

def func_x(ls):
    n=len(ls)
    s=''
    for i in range(n):
        if n-i==1:
            s=s+ls[i]
        else:
            s=s+ls[i]+'/'
    return s

T['prod1']=T['prod'].apply(lambda x:func_x(x))

What will be this code's equivalent in PySpark?

muni
  • 1,263
  • 4
  • 22
  • 31
  • you need to use `collect_list` and `concat_ws`. Something like `df.groupBy("cust_id").agg(concat_ws("/", collect_list("prod")))` - let me find a dupe – pault Jul 18 '19 at 13:40

1 Answers1

0
import pyspark.sql.functions as F
separator = '/'
T = df.groupby('cust_id').agg(F.concat_ws(separator, f.collect_list(df.col2)))
Sequinex
  • 641
  • 1
  • 9
  • 17