3

I have dataframe with column by name c1, c2, c3, c4. I want to group it on a column and use agg function on other column eg min/max/agg.. etc and get the corresponding other column value based on result of agg function

Example :

c1  c2  c3  c4
1   23  1   1
1   45  2   2
1   91  3   3
1   90  4   4
1   71  5   5
1   42  6   6
1   72  7   7
1   44  8   8
1   55  9   9
1   21  0   0

Should result:

c1  c2  c3  c4
1   91  3   3

let dataframe be df

df.groupBy($"c1").agg(max($"c2"), ??, ??)

can someone please help what should go inplace of ??

i know solution of this problem using RDD. Wanted to explore if this can be solve in easier way using Dataframe/Dataset api

  • I found this answer helpful. It proposes to do exactly what you want. : https://stackoverflow.com/questions/41236804/spark-dataframes-reducing-by-key – user238607 Aug 22 '17 at 17:05

1 Answers1

2

You can do this in two steps:

  • calculate the aggregated data frame;

  • join the data frame back with the original data frame and filter based on the condition;

so:

val maxDF = df.groupBy("c1").agg(max($"c2").as("maxc2"))
// maxDF: org.apache.spark.sql.DataFrame = [c1: int, maxc2: int]

df.join(maxDF, Seq("c1")).where($"c2" === $"maxc2").drop($"maxc2").show
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  1| 91|  3|  3|
+---+---+---+---+
Psidom
  • 209,562
  • 33
  • 339
  • 356