9

I have nested string like as shown below. I want to flat map them to produce unique rows in Spark

My dataframe has

A,B,"x,y,z",D

I want to convert it to produce output like

A,B,x,D
A,B,y,D
A,B,z,D

How can I do that.

Basically how can i do flat map and apply any function inside the Dataframe

Thanks

zero323
  • 322,348
  • 103
  • 959
  • 935
user2230605
  • 2,390
  • 6
  • 27
  • 45

1 Answers1

21

Spark 2.0+

Dataset.flatMap:

val ds = df.as[(String, String, String, String)]
ds.flatMap { 
  case (x1, x2, x3, x4) => x3.split(",").map((x1, x2, _, x4))
}.toDF

Spark 1.3+.

Use split and explode functions:

val df = Seq(("A", "B", "x,y,z", "D")).toDF("x1", "x2", "x3", "x4")
df.withColumn("x3", explode(split($"x3", ",")))

Spark 1.x

DataFrame.explode (deprecated in Spark 2.x)

df.explode($"x3")(_.getAs[String](0).split(",").map(Tuple1(_)))
zero323
  • 322,348
  • 103
  • 959
  • 935
  • I gotta remember the `Dataset` option -- thanks for adding that. – David Griffin Apr 22 '16 at 12:21
  • @DavidGriffin Thanks. I should have close it as a duplicate but I marked wrong question by mistake so I decided to answer and add something new :) – zero323 Apr 22 '16 at 15:17
  • @zero323 I checked the scala api docs for `explode` in `functions` and it doesn't show as deprecated. https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.functions$ – elghoto Sep 22 '17 at 23:43
  • 1
    @elghoto That link points to the doc for the utility function `explode`, whereas I believe that zero323 is referring to the DataFrame transformation `explode`, which has apparently been deprecated since 2.0.0: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset – sumitsu Oct 10 '17 at 15:23