4

I have input dataframe as below with id, app, and customer

Input dataframe

+--------------------+-----+---------+
|                  id|app  |customer |
+--------------------+-----+---------+
|id1                 |   fw|     WM  |
|id1                 |   fw|     CS  |
|id2                 |   fw|     CS  |
|id1                 |   fe|     WM  |
|id3                 |   bc|     TR  |
|id3                 |   bc|     WM  |
+--------------------+-----+---------+

Expected output

Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe

Expected dataframe

+--------------------+----------+-------+----------+
|                  id|   bc     |     fe|    fw    |
+--------------------+----------+-------+----------+
|id1                 |  0       |     WM|   [WM,CS]|
|id2                 |  0       |      0|   [CS]   |
|id3                 | [TR,WM]  |      0|      0   |
+--------------------+----------+-------+----------+

What have i tried ?

val newDF = df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()

+--------------------+-----+-------+------+
|                  id|bc   |     fe|    fw|
+--------------------+-----+-------+------+
|id1                 |  0  |     WM|    WM|
|id2                 |  0  |      0|    CS|
|id3                 | TR  |      0|     0|
+--------------------+-----+-------+------+

Issue : In my query , i am not able to get the list of customer like [WM,CS] for "id1" under "fw" (as shown in expected output) , only "WM" is coming. Similarly, for "id3" only "TR" is appearing - instead a list should appear with value [TR,WM] under "bc" for "id3"

Need your suggestion to get the list of customer under each app respectively.

marios
  • 8,874
  • 3
  • 38
  • 62
Debaditya
  • 2,419
  • 1
  • 27
  • 46

2 Answers2

4

You can use collect_list if you can bear with an empty List at cells where it should be zero:

df.groupBy("id").pivot("app").agg(collect_list("customer")).show
+---+--------+----+--------+
| id|      bc|  fe|      fw|
+---+--------+----+--------+
|id3|[TR, WM]|  []|      []|
|id1|      []|[WM]|[CS, WM]|
|id2|      []|  []|    [CS]|
+---+--------+----+--------+
Psidom
  • 209,562
  • 33
  • 339
  • 356
4

Using CONCAT_WS we can explode array and can remove the square brackets.

df.groupBy("id").pivot("app").agg(concat_ws(",",collect_list("customer")))
whackamadoodle3000
  • 6,684
  • 4
  • 27
  • 44
hd16
  • 262
  • 2
  • 4