0

I have a DataFrame in spark which looks like this:

+------+-----+-----+-----+
| name | id1 | id2 | id3 |
+------+-----+-----+-----+
|  Bob |   1 |  10 | 100 |
| Jill |   2 |  20 | 200 |
|  Sue |   3 |  30 | 300 |
| Lane |   4 |  40 | 400 |
+------+-----+-----+-----+

I want to sort of explode the id columns like so:

+------+----------+---------+
| name | id_value | id_type |
+------+----------+---------+
|  Bob |        1 |     id1 |
|  Bob |       10 |     id2 |
|  Bob |      100 |     id3 |
| Jill |        2 |     id1 |
| Jill |       20 |     id2 |
| Jill |      200 |     id3 |
|  Sue |        3 |     id1 |
|  Sue |       30 |     id2 |
|  Sue |      300 |     id3 |
| Lane |        4 |     id1 |
| Lane |       40 |     id2 |
| Lane |      400 |     id3 |
+------+----------+---------+

I'm using the scala api. Is there a good way to do this? I've looked at the explode and arrays_zip functions, but not really sure if that's barking up the right tree.

user3685285
  • 6,066
  • 13
  • 54
  • 95
  • Possible duplicate of [How to melt Spark DataFrame?](https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe) – user10938362 May 23 '19 at 15:21

2 Answers2

1

Explode is for arrays and maps but for structured types one could use union:

df.columns
  .collect {
    case column if column.startsWith("id") =>
      df.select($"name",
                col(column) as "id_value",
                lit(column) as "id_type")
  }
  .reduce(_ union _)
  .show()

Output:

+----+--------+-------+
|name|id_value|id_type|
+----+--------+-------+
| Bob|       1|    id1|
|Jill|       2|    id1|
| Sue|       3|    id1|
|Lane|       4|    id1|
| Bob|      10|    id2|
|Jill|      20|    id2|
| Sue|      30|    id2|
|Lane|      40|    id2|
| Bob|     100|    id3|
|Jill|     200|    id3|
| Sue|     300|    id3|
|Lane|     400|    id3|
+----+--------+-------+
ollik1
  • 4,460
  • 1
  • 9
  • 20
  • Thanks, but I'm looking for a SparkSQL solution. Using functional programming methods on `DataFrames` incurs additional SerDe which slows down my jobs. – user3685285 May 23 '19 at 14:39
  • 1
    Scala is used just to generate the query, it is analogous to manually writing `(SELECT name, id_value, ... FROM ...) UNION (SELECT name, id_value, ... FROM ...)`. It does not cause any execution time SerDe – ollik1 May 23 '19 at 14:46
0

Another, perhaps less elegant solution could be to use explode combined with lateral view:

select name, id[0] as id_value, id[1] as id_type
from ds
lateral view explode(array(array(id1, 'id1'), array(id2, 'id2'), array(id3, 'id3'))) ids as id

+----+--------+-------+
|name|id_value|id_type|
+----+--------+-------+
| Bob|       1|    id1|
| Bob|      10|    id2|
| Bob|     100|    id3|
|Jill|       2|    id1|
|Jill|      20|    id2|
|Jill|     200|    id3|
| Sue|       3|    id1|
| Sue|      30|    id2|
| Sue|     300|    id3|
|Lane|       4|    id1|
|Lane|      40|    id2|
|Lane|     400|    id3|
+----+--------+-------+
sachav
  • 1,316
  • 8
  • 11