1

I have a dataframe as follows:

+-----------+
|        f1 |
+-----------+
|[a,b,c]    |
|[e,f,g]    |
|[h,i]      |
+-----------+

I want to explode it to rows along with a repeated unique number field as follows:

+-----------+--------+
|        f1 |     uid|
+-----------+--------+
|a          |       1|
|b          |       1|
|c          |       1|
|e          |       2|
|f          |       2|
|g          |       2|
|h          |       3|
|i          |       3|
+-----------+--------+

I can perform explode directly as explained here - Spark: Explode a dataframe array of structs and append id

but I am not sure on how to add the uid field to the new dataframe so that each exploded array field would have the same uid and other elements have different uid values.

user3243499
  • 2,953
  • 6
  • 33
  • 75

1 Answers1

4

The right way to do it, is to use monotonically_increasing_id

val df = Seq(Seq("a", "b", "c"), Seq("e", "f", "g"), Seq("h", "i")).toDF("f1")

df
  .withColumn("uid", monotonically_increasing_id)
  .withColumn("f1", explode($"f1"))
  .show
// +---+---+                                                                       
// | f1|uid|
// +---+---+
// |  a|  0|
// |  b|  0|
// |  c|  0|
// |  e|  1|
// |  f|  1|
// |  g|  1|
// |  h|  2|
// |  i|  2|
// +---+---+

The number won't necessary be consecutive as in the example, but will uniquely identify the source.

Don't use rank().over(Window.orderBy("f1")). It is inherently sequential and not scalable and such should be avoided with exception to local Datasets (i.e. one which return true from isLocal).

  • So I find your don't... interesting. In that there is nothing but blurb on the under the hood optimization with use of DF and DS. You state the opposite, which I concur with. – thebluephantom Sep 23 '18 at 15:49