How to find first non-null values in groups? (secondary sorting using dataset api)

Question

I am working on a dataset which represents a stream of events (like fired as tracking events from a website). All the events have a timestamp. One use case we often have is trying to find the 1st non null value for a given field. So for example something like gets us most the way there:

val eventsDf = spark.read.json(jsonEventsPath) 

case class ProjectedFields(visitId: String, userId: Int, timestamp: Long ... )

val projectedEventsDs = eventsDf.select(
    eventsDf("message.visit.id").alias("visitId"),
    eventsDf("message.property.user_id").alias("userId"),
    eventsDf("message.property.timestamp"),

    ...

).as[ProjectedFields]

projectedEventsDs.groupBy($"visitId").agg(first($"userId", true))

The problem with the above code is that the order of the data being fed into that first aggregation function is not guaranteed. I would like it to be sorted by timestamp to ensure that it is the 1st non null userId by timestamp rather than any random non null userId.

Is there a way to define the sorting within a grouping?

Using Spark 2.10

BTW, the way suggested for Spark 2.10 in SPARK DataFrame: select the first row of each group is to do ordering before the grouping -- that doesn't work. For example the following code:

case class OrderedKeyValue(key: String, value: String, ordering: Int)
val ds = Seq(
  OrderedKeyValue("a", null, 1), 
  OrderedKeyValue("a", null, 2), 
  OrderedKeyValue("a", "x", 3), 
  OrderedKeyValue("a", "y", 4), 
  OrderedKeyValue("a", null, 5)
).toDS()

ds.orderBy("ordering").groupBy("key").agg(first("value", true)).collect()

Will sometimes return Array([a,y]) and sometimes Array([a,x])

Possible duplicate of [SPARK DataFrame: select the first row of each group](http://stackoverflow.com/questions/33878370/spark-dataframe-select-the-first-row-of-each-group) — zero323, Mar 23 '17 at 21:49
sadly i've found a lot of solutions for this using RDD's -- groupByKey and then do a mapValue, but RDD's are pretty suboptimal/just plain fail at a lot of the other things my use case requires. — hiroprotagonist, Mar 26 '17 at 23:35

score 8 · Accepted Answer · answered Mar 24 '17 at 22:12

8

Use my beloved windows (...and experience how much simpler your life becomes !)

import org.apache.spark.sql.expressions.Window
val byKeyOrderByOrdering = Window
  .partitionBy("key")
  .orderBy("ordering")
  .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)

import org.apache.spark.sql.functions.first
val firsts = ds.withColumn("first",
  first("value", ignoreNulls = true) over byKeyOrderByOrdering)

scala> firsts.show
+---+-----+--------+-----+
|key|value|ordering|first|
+---+-----+--------+-----+
|  a| null|       1|    x|
|  a| null|       2|    x|
|  a|    x|       3|    x|
|  a|    y|       4|    x|
|  a| null|       5|    x|
+---+-----+--------+-----+

NOTE: Somehow, Spark 2.2.0-SNAPSHOT (built today) could not give me the correct answer with no rangeBetween which I thought should've been the default unbounded range.

answered Mar 24 '17 at 22:12

Jacek Laskowski

72,696
27
242
420

hmm -- so considering that the final result i want is something like `["a", "x"]`, does that essentially mean that SPARK will plan two reduce steps? – hiroprotagonist Mar 26 '17 at 22:41
"Reduce steps"? What do you think they they were? – Jacek Laskowski Mar 27 '17 at 06:57
yea i guess i am confused what would happen hood w/ the introduction of the window function. because its associative, the first function could reduce down the data in the nodes before sending to the driver for the final aggregation ( i hope!). a secondary sort over the whole key space though -- could it still do that? – hiroprotagonist Mar 27 '17 at 19:09
@Soumyajit Please ask a separate question with more details to explain your requirements. Thanks. – Jacek Laskowski Aug 27 '20 at 21:15

How to find first non-null values in groups? (secondary sorting using dataset api)

1 Answers1

Linked