Transpose a dataframe in Pyspark

Question

how can I do to transpose the following data frame in Pyspark?

The idea is to achieve the result that appears below.

import pandas as pd

d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
     'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
     'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
     'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}

   id place  value attribute
a   1     A     10      size
b   1     A     30    height
c   1     A     20    weigth
d   2     A     10      size
e   2     A     30    height
f   2     A     20    weigth
g   3     A     10      size
h   3     A     30    height
i   3     A     20    weigth

d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
     'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
     'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
     'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}

df = pd.DataFrame(d)
print(df)

   id place  size  height  weigth
a   1     A    10      10      10
b   2     A    30      30      30
c   3     A    20      20      20

Any help is welcome. From already thank you very much

Possible duplicate of [How to pivot DataFrame?](https://stackoverflow.com/questions/30244910/how-to-pivot-dataframe) — 10465355, Nov 24 '18 at 10:55

score 7 · Accepted Answer · answered Nov 24 '18 at 01:27

First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:

from pyspark.sql.functions import first
l =[( 1        ,'A', 10, 'size' ),
( 1        , 'A', 30, 'height' ),
( 1        , 'A', 20, 'weigth' ),
( 2        , 'A', 10, 'size' ),
( 2        , 'A', 30, 'height' ),
( 2        , 'A', 20, 'weigth' ),
( 3        , 'A', 10, 'size' ),
( 3        , 'A', 30, 'height' ),
( 3        , 'A', 20, 'weigth' )]

df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])

df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()

+---+-----+------+----+------+ 
| id|place|height|size|weigth|
+---+-----+------+----+------+ 
|  2|    A|    30|  10|    20| 
|  3|    A|    30|  10|    20| 
|  1|    A|    30|  10|    20|
+---+-----+------+----+------+

is there a way to replace the null values with 0? My output has null values — nak5120, Jan 02 '20 at 21:06
Sure [fillna](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna). — cronoik, Jan 05 '20 at 13:23
In my case where I was transposing a df with 100s of columns, I needed to increase the max allowed pivot vals with this command: spark.conf.set('spark.sql.pivotMaxValues', u'1000') — Jariani, Jun 09 '22 at 17:07

score 1 · Answer 2 · answered Nov 25 '18 at 10:10

Refer to the documentation. Pivoting is always done in context to aggregation, and I have chosen sum here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.

df = df.groupBy(["id","place"]).pivot("attribute").sum("value")

This link also addresses the same question.

Transpose a dataframe in Pyspark

2 Answers2