I have nested field like below. I want to call flatmap (I think) to produce a flattened row.
My dataset has
A,B,[[x,y,z]],C
I want to convert it to produce output like
A,B,X,Y,Z,C
This is for Spark 2.0+
Thanks!
I have nested field like below. I want to call flatmap (I think) to produce a flattened row.
My dataset has
A,B,[[x,y,z]],C
I want to convert it to produce output like
A,B,X,Y,Z,C
This is for Spark 2.0+
Thanks!
Apache DataFu has a generic explodeArray method that will do exactly what you need.
import datafu.spark.DataFrameOps._
val df = sc.parallelize(Seq(("A","B",Array("X","Y","Z"),"C"))).toDF
df.explodeArray(col("_3"), "token").show
This will produce:
+---+---+---------+---+------+------+------+
| _1| _2| _3| _4|token0|token1|token2|
+---+---+---------+---+------+------+------+
| A| B|[X, Y, Z]| C| X| Y| Z|
+---+---+---------+---+------+------+------+
One thing to consider is that this method evaluates the data frame in order to determine how many columns to create - if it's expensive to compute it should be cached.
Full disclosure - I am a member of Apache DataFu.
Try this for RDD:
val rdd = sc.parallelize(Seq(("A","B",Array("X","Y","Z"),"C")))
rdd.flatMap(x => (Option(x._3).map(y => (x._1,x._2,y(0),y(1),y(2),x._4 )))).collect.foreach(println)
Output:
(A,B,X,Y,Z,C)