GroupBy and flatten pyspark dataframe based on column value

Question

I have a dataframe that looks like this:

amount1 amount2 period  customerId
3.0     4.0     weekly   12345
23.0    26.0    monthly  12345
201.0   190.0   yearly    12345
3.1     4.2     weekly   56789
21.2    36.5    monthly  56789
231.0   191.0   yearly    56789

I'd like to flatten it like this :

weekly_amt1 weekly_amt2 monthly_amt1 monthly_amt2 yearly_amt1 yearly_amt2 customerId
3.0         4.0         23.0         26.0         201.0       190.0       12345
3.1         4.2         21.2         36.5         231.0       191.0       56789

What would be the most efficient way to do this ? I know that there is a pivot function, but it doesn't seem to handle pivoting on more than 1 column, and since it's expensive I don't want to pivot more than once.

Not really, how do I get the column names ? – femibyte Jul 05 '18 at 09:45 — femibyte, Jul 05 '18 at 09:45

score 0 · Answer 1 · answered Jul 05 '18 at 11:54

I've been able to find an answer:

def pivot_mult(df, cols):
    mydf = df.select('customerId').drop_duplicates()
    for c in cols:
        mydf = mydf.join(df.withColumn('combcol',\
               F.concat(F.lit('{}_'.format(c)),df['period']))\
                .groupby('customerId').pivot('combcol').\
                         agg(F.first(c)),'customerId')
    return mydf

GroupBy and flatten pyspark dataframe based on column value

1 Answers1