How to explode columns?

Question

After:

val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")

I have this DataFrame in Apache Spark:

+------+---------+
| Col1 | Col2    |
+------+---------+
|  1   |[2, 3, 4]|
|  1   |[2, 3, 4]|
+------+---------+

How do I convert this into:

+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
|  1   |  2   |  3   |  4   |
|  1   |  2   |  3   |  4   |
+------+------+------+------+

sgvd · Accepted Answer · 2016-05-23T14:07:26.140

21

A solution that doesn't convert to and from RDD:

df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")

Or arguable nicer:

val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))

The size of a Spark array column is not fixed, you could for instance have:

+----+------------+
|Col1|        Col2|
+----+------------+
|   1|   [2, 3, 4]|
|   1|[2, 3, 4, 5]|
+----+------------+

So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:

val nElements = df.select("Col2").first.getList(0).size

edited May 23 '16 at 14:07

answered May 23 '16 at 13:48

sgvd

3,819
18
31

Nice answer. is there way to use advance DataSet APIs to achieve above in more type safe way? For e.g. how can I get rid of $"Col1"? – Mohammad Adnan Oct 10 '16 at 06:46
What do you mean exactly? I guess not getting rid of it by just not adding it to the `select` :) I guess you don't want to name it explicitly? But I'm not sure how you see type safety coming into play necessarily. It is probably worth it to create a new question with a minimal example of what you want to do. – sgvd Oct 10 '16 at 14:21
sure i'll add another question. but what i mean is DataSet of spark provides good compile time safety. Where as $"col1" would be evaluated at run time. So if col1 is not present in dataset I get compile time error in select statement rather than getting at runtime. – Mohammad Adnan Oct 11 '16 at 10:40

score 3 · Answer 2 · answered Nov 03 '17 at 23:33

3

Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:

from pyspark.sql import functions as F            
df.select([F.col('Col2').getItem(i) for i in range(nElements)])

answered Nov 03 '17 at 23:33

Shane Halloran

318
4
9

how would you add "Col1" to the select statement? – user422930 Sep 08 '18 at 07:55
@user422930, thank-you for your question, sorry I only saw it now. Try doing something like: `from pyspark.sql import functions as F df.select([F.col('Col1')]+[F.col('Col2').getItem(i) for i in range(nElements)])` I haven't tested it, but let me know if it works or not. – Shane Halloran Nov 24 '18 at 22:11

score 1 · Answer 3 · answered Aug 01 '17 at 13:41

1

Just add on to sgvd's solution:

If the size is not always the same, you can set nElements like this:

val nElements = df.select(size('Col2).as("Col2_count"))
                  .select(max("Col2_count"))
                  .first.getInt(0)

answered Aug 01 '17 at 13:41

Yuan Zhao

479
4
6

score 0 · Answer 4 · answered May 23 '16 at 13:00

0

You can use a map:

df.map {
    case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2))
}.toDF("Col1", "Col2", "Col3", "Col4").show()

answered May 23 '16 at 13:00

Carlos Vilchez

2,774
28
30

5

What if I don't know how many colums I have on the array ? I would Like to have something like explodeToColums – Jorge Machado May 23 '16 at 13:09
@JorgeMachado, you can find my solution for explodeToColumns in SparkR https://stackoverflow.com/questions/35919194/using-sparkr-how-to-split-a-string-column-into-n-multiple-columns/73677719#73677719 – Habib Karbasian Sep 11 '22 at 08:16

score 0 · Answer 5 · answered Sep 11 '22 at 08:08

0

If you are working with SparkR, you can find my answer here where you don't need to use explode but you need SparkR::dapply and stringr::str_split_fixed.

answered Sep 11 '22 at 08:08

Habib Karbasian

556
8
18

How to explode columns?

5 Answers5

Linked