Convert row values into columns with its value from another column in spark scala

Question

I'm trying to convert values from row into different columns with its value from another column. For example -

Input dataframe is like -

+-----------+
| X | Y | Z |
+-----------+
| 1 | A | a |
| 2 | A | b |
| 3 | A | c |
| 1 | B | d |
| 3 | B | e |
| 2 | C | f |
+-----------+

And the output dataframe should be like this -

+------------------------+
| Y | 1    | 2    | 3    |
+------------------------+
| A | a    | b    | c    |
| B | d    | null | e    |
| C | null | f    | null |
+------------------------+

I've tried to groupBy the values based on Y and collect_list on X and Z and then zipped X & Z together to get some sort of key-value pairs. But some Xs may be missing for some values of Y so in order to fill them with null values, I cross joined all possible values of X and all possible values of Y and then joined it my original dataframe. This is approach is highly inefficient.

Is there any efficient method to approach this problem ? Thanks in advance.

Specifically with `first` aggregate function. – Alper t. Turker May 15 '18 at 10:46 — Alper t. Turker, May 15 '18 at 10:46

score 6 · Accepted Answer · answered May 15 '18 at 10:54

6

You can simply use groupBy with pivot and first as aggregate function as

import org.apache.spark.sql.functions._
df.groupBy("Y").pivot("X").agg(first("z"))

Output:

+---+----+----+----+
|Y  |1   |2   |3   |
+---+----+----+----+
|B  |d   |null|e   |
|C  |null|f   |null|
|A  |a   |b   |c   |
+---+----+----+----+

answered May 15 '18 at 10:54

koiralo

22,594
6
51
72

It worked. Thanks. – Ishan May 15 '18 at 10:56
Sure, accepted it. – Ishan May 15 '18 at 18:58
df.groupBy("Y").pivot("X").agg(first("Z")) – Quetzalcoatl Aug 15 '18 at 06:30
1

Works with `pyspark` API as well – trianta2 Feb 16 '23 at 21:22

Convert row values into columns with its value from another column in spark scala

1 Answers1

Linked