how to take value for same answered more than once and need to create each value one column

Question

I have data like below, want to take data for same id from one column and put each answer in different new columns respectively

actual         

ID  Brandid  
1   234      
1   122      
1   134      
2   122
3   234
3   122


Excpected

ID BRANDID_1  BRANDID_2  BRANDID_3
1     234       122         134
2     122        -           -
3     234       122          -

Possible duplicate of [How to pivot Spark DataFrame?](https://stackoverflow.com/questions/30244910/how-to-pivot-spark-dataframe) — pault, Jun 12 '19 at 16:36

Ben.T · Answer 1 · 2019-06-14T13:17:21.873

You can use pivot after a groupBy, but first you can create a column with the future column name using row_number to get monotically number per ID over a Window. Here is one way:

import pyspark.sql.functions as F
from pyspark.sql.window import Window

# create the window on ID and as you need orderBy after, 
# you can use a constant to keep the original order do F.lit(1)
w = Window.partitionBy('ID').orderBy(F.lit(1)) 

#           create the column with future columns name to pivot on
pv_df = (df.withColumn('pv', F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string'))) 
#           groupby the ID and pivot on the created column
           .groupBy('ID').pivot('pv')
#          in aggregation, you need a function so we use first
           .agg(F.first('Brandid')))

and you get

pv_df.show()
+---+---------+---------+---------+
| ID|Brandid_1|Brandid_2|Brandid_3|
+---+---------+---------+---------+
|  1|      234|      122|      134|
|  3|      234|      122|     null|
|  2|      122|     null|     null|
+---+---------+---------+---------+

EDIT: to get the column in order as OP requested, you can use lpad, first define the length for number you want:

nb_pad = 3

and replace in the above method F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')) by

F.concat(F.lit('Brandid_'), F.lpad(F.row_number().over(w).cast('string'), nb_pad, "0"))

and if you don't know how many "0" you need to add (here it was number of length of 3 overall), then you can get this value by

nb_val = len(str(sdf.groupBy('ID').count().select(F.max('count')).collect()[0][0]))

Thanks, i got but iam getting the column schema as below Brandid_1, Brandid_10,Brandid_100..Brandid_199, Brandid_2,Brandid_20... — ElangoJK Jaganathan Kandammal, Jun 13 '19 at 13:19
@ElangoJKJaganathanKandammal you mean the columns are not in the correct order? — Ben.T, Jun 13 '19 at 13:32
@ElangoJKJaganathanKandammal I edited the answer to make the column in order — Ben.T, Jun 14 '19 at 13:18

how to take value for same answered more than once and need to create each value one column

1 Answers1