How to flatten a pyspark dataframes that contains multiple rows per id?

Question

I have a pyspark dataframe with two id columns id and id2. Each id is repeated exactly n times. All id's have the same set of id2's. I'm trying to "flatten" the matrix resulting from each unique id into one row according to id2.

Here's an example to explain what I'm trying to achieve, my dataframe looks like this:

+----+-----+--------+--------+
| id | id2 | value1 | value2 |
+----+-----+--------+--------+
| 1  | 1   | 54     | 2      |
+----+-----+--------+--------+
| 1  | 2   | 0      | 6      |
+----+-----+--------+--------+
| 1  | 3   | 578    | 14     |
+----+-----+--------+--------+
| 2  | 1   | 10     | 1      |
+----+-----+--------+--------+
| 2  | 2   | 6      | 32     |
+----+-----+--------+--------+
| 2  | 3   | 0      | 0      |
+----+-----+--------+--------+
| 3  | 1   | 12     | 2      |
+----+-----+--------+--------+
| 3  | 2   | 20     | 5      |
+----+-----+--------+--------+
| 3  | 3   | 63     | 22     |
+----+-----+--------+--------+

The desired output is the following table:

+----+----------+----------+----------+----------+----------+----------+
| id | value1_1 | value1_2 | value1_3 | value2_1 | value2_2 | value2_3 |
+----+----------+----------+----------+----------+----------+----------+
| 1  | 54       | 0        | 578      | 2        | 6        | 14       |
+----+----------+----------+----------+----------+----------+----------+
| 2  | 10       | 6        | 0        | 1        | 32       | 0        |
+----+----------+----------+----------+----------+----------+----------+
| 3  | 12       | 20       | 63       | 2        | 5        | 22       |
+----+----------+----------+----------+----------+----------+----------+

So, basically, for each unique id and for each column col, I will have n new columns col_1,... for each of the n id2 values.

Any help would be appreciated!

https://stackoverflow.com/questions/33732346/spark-dataframe-transform-multiple-rows-to-column — Roee N, Sep 16 '19 at 11:37
Possible duplicate of [Spark dataframe transform multiple rows to column](https://stackoverflow.com/questions/33732346/spark-dataframe-transform-multiple-rows-to-column) — eliasah, Sep 16 '19 at 14:25

Mahesh Gupta · Answer 1 · 2019-09-16T14:34:11.217

In Spark 2.4 you can do this way

var df3 =Seq((1,1,54 , 2 ),(1,2,0  , 6 ),(1,3,578, 14),(2,1,10 , 1 ),(2,2,6  , 32),(2,3,0  , 0 ),(3,1,12 , 2 ),(3,2,20 , 5 ),(3,3,63 , 22)).toDF("id","id2","value1","value2")


scala> df3.show()
+---+---+------+------+
| id|id2|value1|value2|
+---+---+------+------+
|  1|  1|    54|     2|
|  1|  2|     0|     6|
|  1|  3|   578|    14|
|  2|  1|    10|     1|
|  2|  2|     6|    32|
|  2|  3|     0|     0|
|  3|  1|    12|     2|
|  3|  2|    20|     5|
|  3|  3|    63|    22|
+---+---+------+------+

using coalesce retrieve the first value of the id.

scala> var df4 = df3.groupBy("id").pivot("id2").agg(coalesce(first("value1")),coalesce(first("value2"))).orderBy(col("id"))

scala> val newNames = Seq("id","value1_1","value2_1","value1_2","value2_2","value1_3","value2_3")

Renaming columns

scala>  df4.toDF(newNames: _*).show()
+---+--------+--------+--------+--------+--------+--------+
| id|value1_1|value2_1|value1_2|value2_2|value1_3|value2_3|
+---+--------+--------+--------+--------+--------+--------+
|  1|      54|       2|       0|       6|     578|      14|
|  2|      10|       1|       6|      32|       0|       0|
|  3|      12|       2|      20|       5|      63|      22|
+---+--------+--------+--------+--------+--------+--------+

rearranged column if needed. let me know if you have any question related to the same. HAppy HAdoop

@Achraf Oussidi accept the answer if its solve you purpose – Mahesh Gupta Sep 16 '19 at 14:45 — Mahesh Gupta, Sep 16 '19 at 14:45

How to flatten a pyspark dataframes that contains multiple rows per id?

1 Answers1