1

The PySpark documentation says that first() and last() functions of Spark are non-deterministic (without mentioning the use "inside" of windows) ; while doing some research on this, i found this answer that states:

You could still use last and first functions over a Window which guarantees determinism

So, are first and last deterministic when used over a Window and non-deterministic when used on a Group? Is there some documentation confirming this?

2 Answers2

1

I can't find any documentation supporting this, but this is just from my experience:

first and last are deterministic only if the Window has a well-defined ordering. There has to be an orderBy clause in the WindowSpec definition, and also the ordering has to be unique, i.e. no two rows should have the same value in the column being ordered by.

mck
  • 40,932
  • 13
  • 35
  • 50
0

MCK's answer is right, because the underlying tables being queried have no explicit ordering in Spark.

In order to make these functions deterministic, you need to make sure to use an ORDER BY clause that is unique. I usually include the table's PK as a final tiebreaker in my ORDER BY clauses to ensure deterministic results.

This thread has more in-depth answers for a similar window function: Is ORDER BY and ROW_NUMBER() deterministic?

Stu
  • 21
  • 5