0

i have a requirement to use pivot on below table.

id,week,score
102,1,96
101,1,138
102,1,37
101,1,59
101,2,282
102,2,212
102,2,78
101,2,97
102,3,60
102,3,123
101,3,220
101,3,87

output

      1         2         3

101   138,59   282,97     220,87
102   96,37    212,78     123,60

here i need to sort score

i have tried below code but it only works when there is a single record on a particular id

df.groupBy("id").pivot("week").agg(first("score"))
LUZO
  • 1,019
  • 4
  • 19
  • 42
  • 1
    Possible duplicate of [How to pivot DataFrame?](https://stackoverflow.com/questions/30244910/how-to-pivot-dataframe) – koiralo Dec 28 '17 at 13:33
  • No shankar, Here i am expecting all records as an array..please check the expected output. – LUZO Dec 28 '17 at 17:06

2 Answers2

3

Instead of first you should be collecting all the values using collect_list, this will give you the result in list

import org.apache.spark.sql.functions._

df.groupBy("id").pivot("week").agg(collect_list("score")).show()

output :

+---+---------+---------+---------+
|id |1        |2        |3        |
+---+---------+---------+---------+
|101|[138, 59]|[282, 97]|[220, 87]|
|102|[96, 37] |[212, 78]|[60, 123]|
+---+---------+---------+---------+
Prasad Khode
  • 6,602
  • 11
  • 44
  • 59
1

Equivalent python code for pyspark for the scala answer posted by Prasad Khode is as below

from pyspark.sql import functions as F
df.groupBy("id").pivot("week").agg(F.collect_list("score")).show()

If you check the api document you can see

collect_list(Column e)
Aggregate function: returns a list of objects with duplicates.

You can use collect_set as well which will give you the same output with duplicates removed.

df.groupBy("id").pivot("week").agg(F.collect_set("score")).show()

api document says the following

collect_set(Column e)
Aggregate function: returns a set of objects with duplicate elements eliminated.

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • Thanks for your input Ramesh, May i have a chance of getting same output using sql query in pyspark. i have used this query which shows error "select id,list((case when week = 1 then score else null end) as `1`,','),list((case when week = 2 then score else null end) as `2`,','),list((case when week = 3 then score else null end) as `3`,',') from d group by id" – LUZO Dec 29 '17 at 06:09
  • @LUZO, I am afraid I can't be a good help on that as sql queries are nightmares for me. :) you can post another question with much detail explaining your requirements. You will certainly get an answer. – Ramesh Maharjan Dec 29 '17 at 06:15