0

I am new to Spark and I have some questions about the aggregation function MAX and MIN in SparkSQL

In SparkSQL, when I use the MAX / MIN function only MAX(value) / MIN(value) is returned. But How about if I also want other corresponding column?

For e.g. Given a dataframe with columns time, value and label, how can I get the time with the MIN(Value) grouped by label?

Thanks.

Jamin
  • 329
  • 1
  • 4
  • 10
  • do you know how to do this in regular sql? normally you would do something like `ORDER BY value desc LIMIT 1` – maxymoo Mar 17 '16 at 03:05
  • @maxymoo Thanks. Its related to grouping so I prefer using aggregation function. – Jamin Mar 17 '16 at 03:07
  • @libenn after your edit, actually maxymoo's way might be the easiest. What you want to do won't work since you are using an aggregation function, all the results have to be aggregation function results or columns by which you group by (label in your case). – Mateusz Dymczyk Mar 17 '16 at 03:27
  • @MateuszDymczyk Thanks. But how can I group them by label in maxymoo's case ? – Jamin Mar 17 '16 at 05:10

2 Answers2

1

You need to do a first do a groupBy, and then join that back to the original DataFrame. In Scala, it looks like this:

df.join(
  df.groupBy($"label").agg(min($"value") as "min_value").withColumnRenamed("label", "min_label"), 
  $"min_label" === $"label" && $"min_value" === $"value"
).drop("min_label").drop("min_value").show

I don't use Python, but it would look close to the above.

You can even do max() and min() in one pass:

df.join(
  df.groupBy($"label")
    .agg(min($"value") as "min_value", max($"value") as "max_value")
    .withColumnRenamed("label", "r_label"), 
  $"r_label" === $"label" && ($"min_value" === $"value" || $"max_value" === $"value")
).drop("r_label")
David Griffin
  • 13,677
  • 5
  • 47
  • 65
-1

You can use sortByKey(true) for sorting by ascending order and then apply action "take(1)" to get Max.

And use sortByKey(false) for sorting by descending order and then apply action "take(1)" to get Min

If you want to use spark-sql way, you can follow the approach explained by @maxymoo

Sivakumar
  • 344
  • 3
  • 8