How to flatten a nested field in a Spark Dataset?

Question

I have nested field like below. I want to call flatmap (I think) to produce a flattened row.

My dataset has

A,B,[[x,y,z]],C

I want to convert it to produce output like

A,B,X,Y,Z,C

This is for Spark 2.0+

Thanks!

http://blog.thedigitalcatonline.com/blog/2015/04/07/99-scala-problems-07-flatten/ — Shankar Shastri, Jun 29 '18 at 01:17

score 1 · Answer 1 · answered Oct 12 '21 at 09:39

Apache DataFu has a generic explodeArray method that will do exactly what you need.

import datafu.spark.DataFrameOps._

val df = sc.parallelize(Seq(("A","B",Array("X","Y","Z"),"C"))).toDF
df.explodeArray(col("_3"), "token").show

This will produce:

+---+---+---------+---+------+------+------+
| _1| _2|       _3| _4|token0|token1|token2|
+---+---+---------+---+------+------+------+
|  A|  B|[X, Y, Z]|  C|     X|     Y|     Z|
+---+---+---------+---+------+------+------+

One thing to consider is that this method evaluates the data frame in order to determine how many columns to create - if it's expensive to compute it should be cached.

Full disclosure - I am a member of Apache DataFu.

score 0 · Answer 2 · answered Jun 29 '18 at 01:47

0

Try this for RDD:

val rdd = sc.parallelize(Seq(("A","B",Array("X","Y","Z"),"C")))

rdd.flatMap(x => (Option(x._3).map(y => (x._1,x._2,y(0),y(1),y(2),x._4 )))).collect.foreach(println)

Output:

(A,B,X,Y,Z,C)

answered Jun 29 '18 at 01:47

1pluszara

1,518
3
14
26

It's very specific – thebluephantom Jun 29 '18 at 08:15
may not be possible to be generic – thebluephantom Jun 29 '18 at 09:13
Try this - `https://stackoverflow.com/questions/37471346/automatically-and-elegantly-flatten-dataframe-in-spark-sql ` – 1pluszara Jun 29 '18 at 09:14
Provide more details on your data like schema! – 1pluszara Jun 29 '18 at 09:15
Looked on SO. As soon as intermixing occurs seems problematic – thebluephantom Jun 29 '18 at 09:51

How to flatten a nested field in a Spark Dataset?

2 Answers2