pivot in PYSPARKSQL

Question

i have a requirement to use pivot on below table.

id,week,score
102,1,96
101,1,138
102,1,37
101,1,59
101,2,282
102,2,212
102,2,78
101,2,97
102,3,60
102,3,123
101,3,220
101,3,87

output

      1         2         3

101   138,59   282,97     220,87
102   96,37    212,78     123,60

here i need to sort score

i have tried below code but it only works when there is a single record on a particular id

df.groupBy("id").pivot("week").agg(first("score"))

Possible duplicate of [How to pivot DataFrame?](https://stackoverflow.com/questions/30244910/how-to-pivot-dataframe) — koiralo, Dec 28 '17 at 13:33
No shankar, Here i am expecting all records as an array..please check the expected output. — LUZO, Dec 28 '17 at 17:06

score 3 · Accepted Answer · answered Dec 28 '17 at 12:56

3

Instead of first you should be collecting all the values using collect_list, this will give you the result in list

import org.apache.spark.sql.functions._

df.groupBy("id").pivot("week").agg(collect_list("score")).show()

output :

+---+---------+---------+---------+
|id |1        |2        |3        |
+---+---------+---------+---------+
|101|[138, 59]|[282, 97]|[220, 87]|
|102|[96, 37] |[212, 78]|[60, 123]|
+---+---------+---------+---------+

answered Dec 28 '17 at 12:56

Prasad Khode

6,602
11
44
59

May i get the above output using sql query in pyspark by using temptable. – LUZO Dec 28 '17 at 19:00

score 1 · Answer 2 · answered Dec 29 '17 at 06:02

1

Equivalent python code for pyspark for the scala answer posted by Prasad Khode is as below

from pyspark.sql import functions as F
df.groupBy("id").pivot("week").agg(F.collect_list("score")).show()

If you check the api document you can see

collect_list(Column e)
Aggregate function: returns a list of objects with duplicates.

You can use collect_set as well which will give you the same output with duplicates removed.

df.groupBy("id").pivot("week").agg(F.collect_set("score")).show()

api document says the following

collect_set(Column e)
Aggregate function: returns a set of objects with duplicate elements eliminated.

answered Dec 29 '17 at 06:02

Ramesh Maharjan

41,071
6
69
97

Thanks for your input Ramesh, May i have a chance of getting same output using sql query in pyspark. i have used this query which shows error "select id,list((case when week = 1 then score else null end) as `1`,','),list((case when week = 2 then score else null end) as `2`,','),list((case when week = 3 then score else null end) as `3`,',') from d group by id" – LUZO Dec 29 '17 at 06:09
@LUZO, I am afraid I can't be a good help on that as sql queries are nightmares for me. :) you can post another question with much detail explaining your requirements. You will certainly get an answer. – Ramesh Maharjan Dec 29 '17 at 06:15

pivot in PYSPARKSQL

2 Answers2