4

I'm trying to convert values from row into different columns with its value from another column. For example -

Input dataframe is like -

+-----------+
| X | Y | Z |
+-----------+
| 1 | A | a |
| 2 | A | b |
| 3 | A | c |
| 1 | B | d |
| 3 | B | e |
| 2 | C | f |
+-----------+

And the output dataframe should be like this -

+------------------------+
| Y | 1    | 2    | 3    |
+------------------------+
| A | a    | b    | c    |
| B | d    | null | e    |
| C | null | f    | null |
+------------------------+

I've tried to groupBy the values based on Y and collect_list on X and Z and then zipped X & Z together to get some sort of key-value pairs. But some Xs may be missing for some values of Y so in order to fill them with null values, I cross joined all possible values of X and all possible values of Y and then joined it my original dataframe. This is approach is highly inefficient.

Is there any efficient method to approach this problem ? Thanks in advance.

Ishan
  • 996
  • 3
  • 13
  • 34

1 Answers1

6

You can simply use groupBy with pivot and first as aggregate function as

import org.apache.spark.sql.functions._
df.groupBy("Y").pivot("X").agg(first("z")) 

Output:

+---+----+----+----+
|Y  |1   |2   |3   |
+---+----+----+----+
|B  |d   |null|e   |
|C  |null|f   |null|
|A  |a   |b   |c   |
+---+----+----+----+
koiralo
  • 22,594
  • 6
  • 51
  • 72