How to explode two array fields to multiple columns in Spark?

Question

I was referring to How to explode an array into multiple columns in Spark for a similar need.

I am able to use that code for a single array field dataframe, however, when I have a multiple array fields dataframe, I'm not able to convert both to multiple columns.

For example,

dataframe1

+--------------------+----------------------------------+----------------------------------+
|                 f1 |f2                                |f3                                |
+--------------------+----------------------------------+----------------------------------+
|12                  |                              null|                              null|
|13                  |                              null|                              null|
|14                  |                              null|                              null|
|15                  |                              null|                              null|
|16                  |                              null|                              null|
|17                  |                [[Hi, 256, Hello]]|        [[a, b], [a, b, c],[a, b]]|
|18                  |                              null|                              null|
|19                  |                              null|                              null|
+--------------------+----------------------------------+----------------------------------+

I want to convert it to below dataframe:

dataframe2

+--------------------+----------------------------------+----------------------------------+----------------------------------+
|                 f1 |f2_0                              |f3_0                              |f3_1                              |
+--------------------+----------------------------------+----------------------------------+----------------------------------+
|12                  |                              null|                              null|                              null|
|13                  |                              null|                              null|                              null|
|14                  |                              null|                              null|                              null|
|15                  |                              null|                              null|                              null|
|16                  |                              null|                              null|                              null|
|17                  |                  [Hi, 256, Hello]|                            [a, b]|                         [a, b, c]|
|18                  |                              null|                              null|                              null|
|19                  |                              null|                              null|                              null|
+--------------------+----------------------------------+----------------------------------+----------------------------------+

I tried with the following code:

val dataframe2 = dataframe1.select(
  col("f1") +: (0 until 2).map(i => col("f2")(i).alias(s"f2_$i")): _* +: (0 until 2).map(i => col("f3")(i).alias(s"f3_$i")): _*
)

But it is throwing an error saying it is expecting a ")" after the first "_*".

score 1 · Accepted Answer · answered Sep 19 '18 at 08:02

1

+: is used in Scala to add a single element to a list. It can't be used to concatenate two lists together. Instead, you can use ++ as follows:

val cols = Seq(col("f1")) 
  ++ (0 until 1).map(i => col("f2")(i).alias(s"f2_$i")) 
  ++ (0 until 2).map(i => col("f3")(i).alias(s"f3_$i"))

val dataframe2 = dataframe1.select(cols: _*)

Note that to use this approach, you need to know the number of elements of the lists in advance. Above, I changed 2 to 1 for the f2 column.

answered Sep 19 '18 at 08:02

Shaido

27,497
23
70
73

thanks for the explanation. I have one syntax doubt. What does the last _* expression mean in my code? I didn't understand that part from the reference link. – user3243499 Sep 19 '18 at 09:07
@user3243499: Glad to help. I think the answers here explain it better than I could: https://stackoverflow.com/questions/6051302/what-does-colon-underscore-star-do-in-scala – Shaido Sep 19 '18 at 09:18

score 0 · Answer 2 · answered Sep 20 '18 at 07:48

Shaido answer is already correct and this answer is only an enhancement to that. Here I just added to find the max length of the columns dynamically.

If the column f2 and f3 is already array, there corresponding max array sizes are computed as below.

val s1 = df.select(max(size(df("f2")))).first().getInt(0)
val s2 = df.select(max(size(df("f3")))).first().getInt(0)

Else if the column should be splitted based on the delimiter and further divide into columns, first computing the size as below.

val s1 = df.select(max(size(split(df("f2"), ",")))).first().getInt(0)
val s2 = df.select(max(size(split(df("f3"), ",")))).first().getInt(0)

And then we can use the s1, s2 in the map function in the Shaido answer as (0 until s1).map( .....

How to explode two array fields to multiple columns in Spark?

2 Answers2