2

I'm not sure of a good way to phrase the question, but an example will help. Here is the dataframe that I have with the columns: name, type, and count:

+------+------+-------+
| Name | Type | Count |
+------+------+-------+
| a    |    0 |     5 |
| a    |    1 |     4 |
| a    |    5 |     5 |
| a    |    4 |     5 |
| a    |    2 |     1 |
| b    |    0 |     2 |
| b    |    1 |     4 |
| b    |    3 |     5 |
| b    |    4 |     5 |
| b    |    2 |     1 |
| c    |    0 |     5 |
| c    |  ... |   ... |
+------+------+-------+

I want to get a new dataframe structured like this where the Type column values have become new columns:

+------+---+-----+---+---+---+---+
| Name | 0 |  1  | 2 | 3 | 4 | 5 |  <- Number columns are types from input
+------+---+-----+---+---+---+---+
| a    | 5 |   4 | 1 | 0 | 5 | 5 |
| b    | 2 |   4 | 1 | 5 | 5 | 0 |
| c    | 5 | ... |   |   |   |   |
+------+---+-----+---+---+---+---+

The columns here are [Name,0,1,2,3,4,5].

zero323
  • 322,348
  • 103
  • 959
  • 935
braden.groom
  • 305
  • 2
  • 5
  • 16

1 Answers1

2

Do this by using the pivot function in Spark.

val df2 = df.groupBy("Name").pivot("Type").sum("Count")

Here, if the name and the type is the same for two rows, the count values are simply added together, but other aggregations are possible as well.

Resulting dataframe when using the example data in the question:

+----+---+----+----+----+----+----+
|Name|  0|   1|   2|   3|   4|   5|
+----+---+----+----+----+----+----+
|   c|  5|null|null|null|null|null|
|   b|  2|   4|   1|   5|   5|null|
|   a|  5|   4|   1|null|   5|   5|
+----+---+----+----+----+----+----+
Shaido
  • 27,497
  • 23
  • 70
  • 73