12

What's the difference between explode function and explode operator?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420

2 Answers2

12

spark.sql.functions.explode

explode function creates a new row for each element in the given array or map column (in a DataFrame).

val signals: DataFrame = spark.read.json(signalsJson)
signals.withColumn("element", explode($"data.datapayload"))

explode creates a Column.

See functions object and the example in How to unwind array in DataFrame (from JSON)?

Dataset<Row> explode / flatMap operator (method)

explode operator is almost the explode function.

From the scaladoc:

explode returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.

ds.flatMap(_.words.split(" "))

Please note that (again quoting the scaladoc):

Deprecated (Since version 2.0.0) use flatMap() or select() with functions.explode() instead

See Dataset API and the example in How to split multi-value column into separate rows using typed Dataset?


Despite explode being deprecated (that we could then translate the main question to the difference between explode function and flatMap operator), the difference is that the former is a function while the latter is an operator. They have different signatures, but can give the same results. That often leads to discussions what's better and usually boils down to personal preference or coding style.

One could also say that flatMap (i.e. explode operator) is more Scala-ish given how ubiquitous flatMap is in Scala programming (mainly hidden behind for-comprehension).

Community
  • 1
  • 1
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • 1
    What exactly do you mean by operator? There are no operators in Scala, they are all functions? – Yuval Itzchakov Apr 24 '17 at 08:17
  • 1
    I learnt the term while studying Spark where I noticed Hadoop devs use the term 'operator' to denote...well...operators used to develop Hadoop applications. It took me some time to get used to the term, but realized I'm biased to Scala and think using methods or functions which might not be true in other languages supported by a framework like Spark (think of Python, R or even SQL). – Jacek Laskowski Apr 24 '17 at 08:19
  • 4
    I see. But this question is tagged with the "Scala" tag, so I think it may be confusing. Perhaps you just want to tag it as apache-spark? – Yuval Itzchakov Apr 24 '17 at 08:22
  • And by "operator", you mean "UDF" in Hadoop land? – OneCricketeer Apr 24 '17 at 08:28
  • Not really. Just a general concept/computation to compose applications from (it could be a method, function, UDF, procedure, or SQL clause whatever). Something high-level like a SQL clause. – Jacek Laskowski Apr 24 '17 at 08:31
1

flatMap is much better in performance in comparison to explode as flatMap require much lesser data shuffle. If you are processing big data (>5 GB) the performance difference could be seen evidently.

Asid
  • 51
  • 4