Max and Min of Spark

Question

I am new to Spark and I have some questions about the aggregation function MAX and MIN in SparkSQL

In SparkSQL, when I use the MAX / MIN function only MAX(value) / MIN(value) is returned. But How about if I also want other corresponding column?

For e.g. Given a dataframe with columns time, value and label, how can I get the time with the MIN(Value) grouped by label?

Thanks.

do you know how to do this in regular sql? normally you would do something like `ORDER BY value desc LIMIT 1` — maxymoo, Mar 17 '16 at 03:05
@maxymoo Thanks. Its related to grouping so I prefer using aggregation function. — Jamin, Mar 17 '16 at 03:07
@libenn after your edit, actually maxymoo's way might be the easiest. What you want to do won't work since you are using an aggregation function, all the results have to be aggregation function results or columns by which you group by (label in your case). — Mateusz Dymczyk, Mar 17 '16 at 03:27
@MateuszDymczyk Thanks. But how can I group them by label in maxymoo's case ? — Jamin, Mar 17 '16 at 05:10

David Griffin · Answer 1 · 2016-03-17T05:40:43.157

1

You need to do a first do a groupBy, and then join that back to the original DataFrame. In Scala, it looks like this:

df.join(
  df.groupBy($"label").agg(min($"value") as "min_value").withColumnRenamed("label", "min_label"), 
  $"min_label" === $"label" && $"min_value" === $"value"
).drop("min_label").drop("min_value").show

I don't use Python, but it would look close to the above.

You can even do max() and min() in one pass:

df.join(
  df.groupBy($"label")
    .agg(min($"value") as "min_value", max($"value") as "max_value")
    .withColumnRenamed("label", "r_label"), 
  $"r_label" === $"label" && ($"min_value" === $"value" || $"max_value" === $"value")
).drop("r_label")

edited Mar 17 '16 at 05:40

answered Mar 17 '16 at 03:45

David Griffin

13,677
5
47
65

this won't return the `time` column, and that's what the OP wants – Mateusz Dymczyk Mar 17 '16 at 04:49
Misread the question. Hold on. – David Griffin Mar 17 '16 at 04:54
Edited my answer to actually, you know, answer the question. :) – David Griffin Mar 17 '16 at 05:10
@DavidGriffin Thanks! This actually works! But the number of tasks doubled... – Jamin Mar 17 '16 at 05:24
On the other hand it doesn't do an `ORDER BY`. – David Griffin Mar 17 '16 at 05:27
Also, I can do `min` and `max` in one (well, two apparently) pass. See edits. – David Griffin Mar 17 '16 at 05:38
You could even do `avg()` at the same time, if you wanted. 3 for the price of 1. – David Griffin Mar 17 '16 at 05:50
@DavidGriffin Not any longer, at least not with default settings. Since 1.6 Spark schedules aggregations on multiple columns separately if not configured otherwise. – zero323 Mar 17 '16 at 12:33
a) How do you configure it otherwise? b) Would you expect it to be faster that way or separately? c) Out of curiosity why was it changed? – David Griffin Mar 17 '16 at 13:33

score -1 · Answer 2 · answered Mar 17 '16 at 03:43

You can use sortByKey(true) for sorting by ascending order and then apply action "take(1)" to get Max.

And use sortByKey(false) for sorting by descending order and then apply action "take(1)" to get Min

If you want to use spark-sql way, you can follow the approach explained by @maxymoo

Max and Min of Spark

2 Answers2

Linked