How to transpose column to row with PySpark

Question

I'm trying to transpose some columns of my table to row. I found the previous post: Transpose column to row with Spark

I actually want the opposite way. Initially, I have:

+-----+--------+-----------+   
|  A  | col_id | col_value |  
+-----+--------+-----------+  
|  1  |   col_1|        0.0|  
|  1  |   col_2|        0.6|       
|  ...|     ...|        ...|        
|  2  |   col_1|        0.6|  
|  2  |   col_2|        0.7|   
|  ...|     ...|        ...|      
|  3  |   col_1|        0.5|  
|  3  |   col_2|        0.9|  
|  ...|     ...|        ...|

And what I want is:

+-----+-----+-----+-------+  
|  A  |col_1|col_2|col_...|  
+-----+-------------------+  
|  1  |  0.0|  0.6|  ...  |  
|  2  |  0.6|  0.7|  ...  |  
|  3  |  0.5|  0.9|  ...  |  
|  ...|  ...|  ...|  ...  |

How can I do it? Thanks!

score 0 · Answer 1 · answered May 24 '17 at 10:24

Hi you can use 'when' to emulate SQL CASE like statement, with that statement you redistribute data over columns , if you 'colid' is 'col2' and you are calculating col1 you simply put 0. After that with simple sum you reduce number of rows.

from pyspark.sql import functions as F  
df2=df.select(df.A, F.when(df.colid=='col_1', df.colval).otherwise(0).alias('col1'),F.when(df.colid=='col_2', df.colval)\
          .otherwise(0).alias('col2'))

df2.groupBy(df.A).agg(F.sum("col1").alias('col1'),\
F.sum("col2").alias('col2')).show()

How to transpose column to row with PySpark

1 Answers1