How to flatmap a nested Dataframe in Spark

Question

I have nested string like as shown below. I want to flat map them to produce unique rows in Spark

My dataframe has

A,B,"x,y,z",D

I want to convert it to produce output like

A,B,x,D
A,B,y,D
A,B,z,D

How can I do that.

Basically how can i do flat map and apply any function inside the Dataframe

Thanks

zero323 · Answer 1 · 2017-03-13T18:02:52.770

21

Spark 2.0+

Dataset.flatMap:

val ds = df.as[(String, String, String, String)]
ds.flatMap { 
  case (x1, x2, x3, x4) => x3.split(",").map((x1, x2, _, x4))
}.toDF

Spark 1.3+.

Use split and explode functions:

val df = Seq(("A", "B", "x,y,z", "D")).toDF("x1", "x2", "x3", "x4")
df.withColumn("x3", explode(split($"x3", ",")))

Spark 1.x

DataFrame.explode (deprecated in Spark 2.x)

df.explode($"x3")(_.getAs[String](0).split(",").map(Tuple1(_)))

edited Mar 13 '17 at 18:02

answered Apr 22 '16 at 04:34

zero323

I gotta remember the `Dataset` option -- thanks for adding that. – David Griffin Apr 22 '16 at 12:21
@DavidGriffin Thanks. I should have close it as a duplicate but I marked wrong question by mistake so I decided to answer and add something new :) – zero323 Apr 22 '16 at 15:17
@zero323 I checked the scala api docs for `explode` in `functions` and it doesn't show as deprecated. https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.functions$ – elghoto Sep 22 '17 at 23:43
1

@elghoto That link points to the doc for the utility function `explode`, whereas I believe that zero323 is referring to the DataFrame transformation `explode`, which has apparently been deprecated since 2.0.0: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset – sumitsu Oct 10 '17 at 15:23

1 Answers1