0

I have checked and a bit curious to know the groupBy function of RDD and DataFrame. Is there is any performance difference or something else? Please suggest.

Prashant
  • 13
  • 5

1 Answers1

-1

Come to think of a difference between a DataFrame.groupBy and an RDD.groupBy, RDD's groupBy variant doesn't preserve the order unlike the DataFrame's groupBy variant.

df.orderBy($"date").groupBy($"id").agg(first($"date") as "start_date")

The above works as expected i.e. the aggregated results will be ordered by date. Since the name sounds the same for both RDD and DataFrame, one might think it will work as expected in RDD as well but nope, it's not the case. The reason is the implementation of RDD's groupBy and DataFrame's groupBy is very different. RDD's groupBy may shuffle data according to the keys.

Sivaprasanna Sethuraman
  • 4,014
  • 5
  • 31
  • 60
  • I think it is misleading. `ORDER BY` will affect the execution plan, but there is no [explicit guarantee](https://issues.apache.org/jira/browse/SPARK-16207) and I am not familiar with any test suite that would cover that behavior. Quoting Sean Owen "The problem is I think just about every method doesn't necessarily preserve order, or is not intended to guarantee it, even if it might in many cases.". My own experience shows that this behavior is not stable (or was buggy in different versions). Do you have any authoritative (JIRA, source, design docs, test) reference that supports that claim? – Alper t. Turker Apr 23 '18 at 06:43
  • And "DataFrame's groupBy is very different. RDD's groupBy may shuffle data according to the keys." - this is incorrect. Both `Dataset` and `RDD` will shuffle the data. Just read the execution plan. – Alper t. Turker Apr 23 '18 at 06:44
  • Interesting. Thanks for that JIRA.. I don't have a supporting doc or any other resource. It was more of a thing that I had experienced in multiple cases. Sean Owen's response does throw some light in it. @Prashant, please remove the "answer" tick from this one. – Sivaprasanna Sethuraman Apr 23 '18 at 07:04