6

how can I do to transpose the following data frame in Pyspark?

The idea is to achieve the result that appears below.

import pandas as pd

d = {'id' : pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
     'place' : pd.Series(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
     'value' : pd.Series([10, 30, 20, 10, 30, 20, 10, 30, 20], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']),
     'attribute' : pd.Series(['size', 'height', 'weigth', 'size', 'height', 'weigth','size', 'height', 'weigth'], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])}

   id place  value attribute
a   1     A     10      size
b   1     A     30    height
c   1     A     20    weigth
d   2     A     10      size
e   2     A     30    height
f   2     A     20    weigth
g   3     A     10      size
h   3     A     30    height
i   3     A     20    weigth

d = {'id' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'place' : pd.Series(['A', 'A', 'A'], index=['a', 'b', 'c']),
     'size' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
     'height' : pd.Series([10, 30, 20], index=['a', 'b', 'c']),
     'weigth' : pd.Series([10, 30, 20], index=['a', 'b', 'c'])}

df = pd.DataFrame(d)
print(df)

   id place  size  height  weigth
a   1     A    10      10      10
b   2     A    30      30      30
c   3     A    20      20      20

Any help is welcome. From already thank you very much

10465355
  • 4,481
  • 2
  • 20
  • 44
lolo
  • 646
  • 2
  • 7
  • 19
  • Possible duplicate of [How to pivot DataFrame?](https://stackoverflow.com/questions/30244910/how-to-pivot-dataframe) – 10465355 Nov 24 '18 at 10:55

2 Answers2

7

First of all I don't think your sample output is correct. Your input data has size set to 10, height set to 30 and weigth set to 20 for every id, but the desired output set's everything to 10 for id 1. If this is really what you, please explain it a bit more. If this was a mistake, then you want to use the pivot function. Example:

from pyspark.sql.functions import first
l =[( 1        ,'A', 10, 'size' ),
( 1        , 'A', 30, 'height' ),
( 1        , 'A', 20, 'weigth' ),
( 2        , 'A', 10, 'size' ),
( 2        , 'A', 30, 'height' ),
( 2        , 'A', 20, 'weigth' ),
( 3        , 'A', 10, 'size' ),
( 3        , 'A', 30, 'height' ),
( 3        , 'A', 20, 'weigth' )]

df = spark.createDataFrame(l, ['id','place', 'value', 'attribute'])

df.groupBy(df.id, df.place).pivot('attribute').agg(first("value")).show()

+---+-----+------+----+------+ 
| id|place|height|size|weigth|
+---+-----+------+----+------+ 
|  2|    A|    30|  10|    20| 
|  3|    A|    30|  10|    20| 
|  1|    A|    30|  10|    20|
+---+-----+------+----+------+
cronoik
  • 15,434
  • 3
  • 40
  • 78
  • Thank you! that is what I was looking for! – lolo Nov 26 '18 at 00:36
  • is there a way to replace the null values with 0? My output has null values – nak5120 Jan 02 '20 at 21:06
  • Sure [fillna](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna). – cronoik Jan 05 '20 at 13:23
  • 1
    In my case where I was transposing a df with 100s of columns, I needed to increase the max allowed pivot vals with this command: spark.conf.set('spark.sql.pivotMaxValues', u'1000') – Jariani Jun 09 '22 at 17:07
1

Refer to the documentation. Pivoting is always done in context to aggregation, and I have chosen sum here. So, if for same id, place or attribute, there are multiple values, then their sum will be taken. You could use min,max or mean as well, depending upon what you need.

df = df.groupBy(["id","place"]).pivot("attribute").sum("value")

This link also addresses the same question.

cph_sto
  • 7,189
  • 12
  • 42
  • 78