How to get a single row with the maximum value while keeping the whole row?

Question

I want to get single row for each id where only the maximum value of the charge column is present.

Example input data:

id  name charge 
11  hg   10    
11  mm   20
22  aa   40
22  bb   40

Code I have tried:

df.agg(max("charge"))

I am getting only the maximum value, like this:

charge
40

However, I want to keep the whole row:

id  name charge
11  mm   20
22  aa   40
22  bb   40

How to keep the first two columns as well? The name column can have different values for the same id so it's not possible to use groupBy on both these columns and aggregate the result.

If two rows have the same id and charge both rows should be kept.

score 7 · Accepted Answer · answered Jan 31 '18 at 06:23

It is necessary to group the rows after the id column and then find the maximum value for the charge column in each group. If groupBy is used to achieve this the name column will disappear, as you noticed. Another way is to use a window and partition by id.

To make sure that both rows are kept when the id and charge are the same value but the name is different, the best way is to add a new column maxCharge and then filter the dataframe.

Using the example dataframe from the question:

val w = Window.partitionBy($"id")
val df2 = df.withColumn("maxCharge", max("charge").over(w))
  .filter($"maxCharge" === $"charge")
  .drop("charge")
  .withColumnRenamed("maxCharge", "charge")

Here, first a new column is added with the max value for each id. Then the rows with a charge value less than this are removed. Finally, the new column is renamed charge to match the required output.

Final result:

+---+----+------+
| id|name|charge|
+---+----+------+
| 22|  aa|    40|
| 22|  bb|    40|
| 11|  mm|    20|
+---+----+------+

How can we do this, but only keep one row per max? I.e. if charge is 40 in this case, I don't want both rows of "charge == 40", but only one of them at random. — ajl123, Nov 09 '22 at 18:19
@ajl123: You should be able to do this by skipping the last 3 lines of code and use `df.withColumn("maxCharge", max("charge").over(w))`. — Shaido, Nov 10 '22 at 01:01

score 0 · Answer 2 · edited Jan 31 '18 at 10:12

First step should be to group by on id and fetch max(charge) as new column. second step should be to join input and grouped data set on input.id=grouped_data.id and input.charge=grouped_data.charge.

input:
id  name charge 
11  hg   10    
11  mm   20
22  aa   40
22  bb   40


grouped_data_on_id:
id  max(charge)
11  20
22  40


joined_data_on_id_and_charge:
id  name charge 
11  hg   10    
22  aa   40
22  bb   40

How to get a single row with the maximum value while keeping the whole row?

2 Answers2