19

After:

val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")

I have this DataFrame in Apache Spark:

+------+---------+
| Col1 | Col2    |
+------+---------+
|  1   |[2, 3, 4]|
|  1   |[2, 3, 4]|
+------+---------+

How do I convert this into:

+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
|  1   |  2   |  3   |  4   |
|  1   |  2   |  3   |  4   |
+------+------+------+------+
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Jorge Machado
  • 752
  • 1
  • 8
  • 28

5 Answers5

21

A solution that doesn't convert to and from RDD:

df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")

Or arguable nicer:

val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))

The size of a Spark array column is not fixed, you could for instance have:

+----+------------+
|Col1|        Col2|
+----+------------+
|   1|   [2, 3, 4]|
|   1|[2, 3, 4, 5]|
+----+------------+

So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:

val nElements = df.select("Col2").first.getList(0).size
sgvd
  • 3,819
  • 18
  • 31
  • Nice answer. is there way to use advance DataSet APIs to achieve above in more type safe way? For e.g. how can I get rid of $"Col1"? – Mohammad Adnan Oct 10 '16 at 06:46
  • What do you mean exactly? I guess not getting rid of it by just not adding it to the `select` :) I guess you don't want to name it explicitly? But I'm not sure how you see type safety coming into play necessarily. It is probably worth it to create a new question with a minimal example of what you want to do. – sgvd Oct 10 '16 at 14:21
  • sure i'll add another question. but what i mean is DataSet of spark provides good compile time safety. Where as $"col1" would be evaluated at run time. So if col1 is not present in dataset I get compile time error in select statement rather than getting at runtime. – Mohammad Adnan Oct 11 '16 at 10:40
3

Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:

from pyspark.sql import functions as F            
df.select([F.col('Col2').getItem(i) for i in range(nElements)])
Shane Halloran
  • 318
  • 4
  • 9
  • how would you add "Col1" to the select statement? – user422930 Sep 08 '18 at 07:55
  • @user422930, thank-you for your question, sorry I only saw it now. Try doing something like: `from pyspark.sql import functions as F df.select([F.col('Col1')]+[F.col('Col2').getItem(i) for i in range(nElements)])` I haven't tested it, but let me know if it works or not. – Shane Halloran Nov 24 '18 at 22:11
1

Just add on to sgvd's solution:

If the size is not always the same, you can set nElements like this:

val nElements = df.select(size('Col2).as("Col2_count"))
                  .select(max("Col2_count"))
                  .first.getInt(0)
Yuan Zhao
  • 479
  • 4
  • 6
0

You can use a map:

df.map {
    case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2))
}.toDF("Col1", "Col2", "Col3", "Col4").show()
Carlos Vilchez
  • 2,774
  • 28
  • 30
  • 5
    What if I don't know how many colums I have on the array ? I would Like to have something like explodeToColums – Jorge Machado May 23 '16 at 13:09
  • @JorgeMachado, you can find my solution for explodeToColumns in SparkR https://stackoverflow.com/questions/35919194/using-sparkr-how-to-split-a-string-column-into-n-multiple-columns/73677719#73677719 – Habib Karbasian Sep 11 '22 at 08:16
0

If you are working with SparkR, you can find my answer here where you don't need to use explode but you need SparkR::dapply and stringr::str_split_fixed.

Habib Karbasian
  • 556
  • 8
  • 18