filter rows based on combination of 2 columns in Spark DF

Question

input DF:

  A         B        
  1         1         
  2         1
  2         2
  3         3
  3         1
  3         2 
  3         3
  3         4

I am trying to filter the rows based on the combination of

(A, Max(B))

Output Df:

I am able to do this with

 df.groupBy()

But there are also other columns in the DF which I want to be selected but do not want to be included in the GroupBy So that condition on filtering the rows should only be wrt these columns and not the other columns in the DF. Ay suggestions please>

score 0 · Accepted Answer · answered Jun 15 '18 at 23:28

As suggested in How to get other columns when using Spark DataFrame groupby? you can use window functions

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

df.withColumn("maxB", max(col("B")).over(Window.partitionBy("A"))).where(...)

where ... is replaced by a predicate based on A and maxB.

filter rows based on combination of 2 columns in Spark DF

1 Answers1