-3

I have nested field like below. I want to call flatmap (I think) to produce a flattened row.

My dataset has

A,B,[[x,y,z]],C

I want to convert it to produce output like

A,B,X,Y,Z,C

This is for Spark 2.0+

Thanks!

kevl510
  • 9
  • 1
  • 6

2 Answers2

1

Apache DataFu has a generic explodeArray method that will do exactly what you need.

import datafu.spark.DataFrameOps._

val df = sc.parallelize(Seq(("A","B",Array("X","Y","Z"),"C"))).toDF
df.explodeArray(col("_3"), "token").show

This will produce:

+---+---+---------+---+------+------+------+
| _1| _2|       _3| _4|token0|token1|token2|
+---+---+---------+---+------+------+------+
|  A|  B|[X, Y, Z]|  C|     X|     Y|     Z|
+---+---+---------+---+------+------+------+

One thing to consider is that this method evaluates the data frame in order to determine how many columns to create - if it's expensive to compute it should be cached.

Full disclosure - I am a member of Apache DataFu.

Eyal
  • 3,412
  • 1
  • 44
  • 60
0

Try this for RDD:

val rdd = sc.parallelize(Seq(("A","B",Array("X","Y","Z"),"C")))

rdd.flatMap(x => (Option(x._3).map(y => (x._1,x._2,y(0),y(1),y(2),x._4 )))).collect.foreach(println)

Output:

(A,B,X,Y,Z,C)
1pluszara
  • 1,518
  • 3
  • 14
  • 26