2

I have a Dataframe (Java) that has the following simple schema . Here is an example instance:

+-------------------+
| id | key | Value  |
+-------------------+
| 01 | A   | John   |
| 01 | B   | Nick   |
| 02 | A   | Mary   |
| 02 | B   | Kathy  |
| 02 | C   | Sabrina|
| 03 | B   | George |
+-------------------+

I would like to transform it to the following:

+------------------------------+
| id | A    | B      | C       |
+------------------------------+
| 01 | John | Nick   | null    |
| 02 | Mary | Kathy  | Sabrina |
| 03 | null | George | null    |
+------------------------------+

I tried the pivot operator (because this is what it is actually) but worked partially because once the values A B and C become columns, the contents of the columns can only be numeric.

Dataset<Row> pivotTest2 = pivotTest.groupBy(col("id")).pivot("key").count();

What I would actually like, is in the place of the count, to put the value of the column Value, i.e., something of the form .select(col("Value")), or even .max("Value") would work fine, but I cannot since Value is not an arithmetic column.

philantrovert
  • 9,904
  • 3
  • 37
  • 61
John Bell
  • 21
  • 2

1 Answers1

1

Doing the following should work for you

import static org.apache.spark.sql.functions.*;

Dataset<Row> pivotTest2 = pivotTest.groupBy(col("id")).pivot("key").agg(first("Value"));
pivotTest2.show(false);

which should give you

+---+----+------+-------+
|id |A   |B     |C      |
+---+----+------+-------+
|01 |John|Nick  |null   |
|03 |null|George|null   |
|02 |Mary|Kathy |Sabrina|
+---+----+------+-------+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97