0

I have a pyspark dataframe with two id columns id and id2. Each id is repeated exactly n times. All id's have the same set of id2's. I'm trying to "flatten" the matrix resulting from each unique id into one row according to id2.

Here's an example to explain what I'm trying to achieve, my dataframe looks like this:

+----+-----+--------+--------+
| id | id2 | value1 | value2 |
+----+-----+--------+--------+
| 1  | 1   | 54     | 2      |
+----+-----+--------+--------+
| 1  | 2   | 0      | 6      |
+----+-----+--------+--------+
| 1  | 3   | 578    | 14     |
+----+-----+--------+--------+
| 2  | 1   | 10     | 1      |
+----+-----+--------+--------+
| 2  | 2   | 6      | 32     |
+----+-----+--------+--------+
| 2  | 3   | 0      | 0      |
+----+-----+--------+--------+
| 3  | 1   | 12     | 2      |
+----+-----+--------+--------+
| 3  | 2   | 20     | 5      |
+----+-----+--------+--------+
| 3  | 3   | 63     | 22     |
+----+-----+--------+--------+

The desired output is the following table:

+----+----------+----------+----------+----------+----------+----------+
| id | value1_1 | value1_2 | value1_3 | value2_1 | value2_2 | value2_3 |
+----+----------+----------+----------+----------+----------+----------+
| 1  | 54       | 0        | 578      | 2        | 6        | 14       |
+----+----------+----------+----------+----------+----------+----------+
| 2  | 10       | 6        | 0        | 1        | 32       | 0        |
+----+----------+----------+----------+----------+----------+----------+
| 3  | 12       | 20       | 63       | 2        | 5        | 22       |
+----+----------+----------+----------+----------+----------+----------+

So, basically, for each unique id and for each column col, I will have n new columns col_1,... for each of the n id2 values.

Any help would be appreciated!

Achraf Oussidi
  • 99
  • 3
  • 11

1 Answers1

1

In Spark 2.4 you can do this way

var df3 =Seq((1,1,54 , 2 ),(1,2,0  , 6 ),(1,3,578, 14),(2,1,10 , 1 ),(2,2,6  , 32),(2,3,0  , 0 ),(3,1,12 , 2 ),(3,2,20 , 5 ),(3,3,63 , 22)).toDF("id","id2","value1","value2")


scala> df3.show()
+---+---+------+------+
| id|id2|value1|value2|
+---+---+------+------+
|  1|  1|    54|     2|
|  1|  2|     0|     6|
|  1|  3|   578|    14|
|  2|  1|    10|     1|
|  2|  2|     6|    32|
|  2|  3|     0|     0|
|  3|  1|    12|     2|
|  3|  2|    20|     5|
|  3|  3|    63|    22|
+---+---+------+------+

using coalesce retrieve the first value of the id.

scala> var df4 = df3.groupBy("id").pivot("id2").agg(coalesce(first("value1")),coalesce(first("value2"))).orderBy(col("id"))

scala> val newNames = Seq("id","value1_1","value2_1","value1_2","value2_2","value1_3","value2_3")

Renaming columns

scala>  df4.toDF(newNames: _*).show()
+---+--------+--------+--------+--------+--------+--------+
| id|value1_1|value2_1|value1_2|value2_2|value1_3|value2_3|
+---+--------+--------+--------+--------+--------+--------+
|  1|      54|       2|       0|       6|     578|      14|
|  2|      10|       1|       6|      32|       0|       0|
|  3|      12|       2|      20|       5|      63|      22|
+---+--------+--------+--------+--------+--------+--------+

rearranged column if needed. let me know if you have any question related to the same. HAppy HAdoop

Mahesh Gupta
  • 1,882
  • 12
  • 16