How to filter a dataframe keeping the highest value of a certain column

Question

Given the dataframe below, I want to filter records that shares the same q2, id_q, check_id and keep only the ones with the highest value.

input dataframe:

q1	q2	id_q	check_id	value
sdfsdf	dfsdfsdf	10	10	90
hdfhhd	dfsdfsdf	10	10	80

There are 2 q2 with same id_q, check_id but with different values: 90,80.

I want to return for the same q2, id_q, check_id the line with the highest value. For example above the output is:

So I want to drop duplicates regarding to: check_id and id_q and keep the one with the highest value of valuecolumn

Desired Output:

q1	q2	id_q	check_id	value
sdfsdf	dfsdfsdf	10	10	90

Looks like you want to `groupby` by your wanted columns `q2, id_q, check_id` and aggregate column `value` by the `max` function. — Jonas Palačionis, Nov 09 '22 at 08:37
Does this answer your question? [Select rows of a DataFrame containing minimum of grouping variable in Julia](https://stackoverflow.com/questions/65024962/select-rows-of-a-dataframe-containing-minimum-of-grouping-variable-in-julia) — Nils Gudat, Nov 09 '22 at 09:34
This is basically a duplicate of https://stackoverflow.com/questions/65024962/select-rows-of-a-dataframe-containing-minimum-of-grouping-variable-in-julia (with max instead of min) — Nils Gudat, Nov 09 '22 at 09:34

Przemyslaw Szufel · Answer 1 · 2022-11-10T11:35:45.013

0

For this case this code seems to be shorter that the ones referenced in other answers:

Suppose you have

julia> df = DataFrame(a=["a","a","b","b","b","b"], b=[1,1,2,2,3,3],c=11:16,notimportant=rand(6))
6×4 DataFrame
 Row │ a       b      c      notimportant
     │ String  Int64  Int64  Float64
─────┼────────────────────────────────────
   1 │ a           1     11      0.93785
   2 │ a           1     12      0.877777
   3 │ b           2     13      0.845306
   4 │ b           2     14      0.477606
   5 │ b           3     15      0.722569
   6 │ b           3     16      0.122807

Than you can just do:

julia> combine(groupby(df, [:a, :b]), :c => maximum => :c)
3×3 DataFrame
 Row │ a       b      c
     │ String  Int64  Int64
─────┼──────────────────────
   1 │ a           1     12
   2 │ b           2     14
   3 │ b           3     16

edited Nov 10 '22 at 11:35

answered Nov 09 '22 at 23:21

Przemyslaw Szufel

40,002
3
32
62

Note that this doesn't handle the columns that are neither part of the `groupby` key nor the maximised column (eg. `q1` in the dataframe in the question). For those to be part of the output, you'll have to do some kind of `argmax`-indexing, and at that point the answer in the linked question is probably the shorter, simpler one. – Sundar R Nov 10 '22 at 05:25
@SundarR It seems that this just ignores those columns which is the expected behavior as I assume. I edited the answer to show that. Or am I misunderstood something? - please let me know! – Przemyslaw Szufel Nov 10 '22 at 11:36
OP's "Desired Output" does contain `q1`, that's why I assume ignoring it isn't the expected behaviour for them. But it's useful to have this as one of the solutions, for anyone else who does want this behaviour. – Sundar R Nov 12 '22 at 09:37
Ah OK - you are right in that case `argmax` is in duplicate questions – Przemyslaw Szufel Nov 12 '22 at 10:23

Bruno Castro · Answer 2 · 2023-06-11T21:14:14.460

Expanding on the answer by Przemyslaw Szufel.

In case anyone is interested in also keeping the rest of the columns based on the argmax you could use something like this:

using DataFrames: DataFrame
df = DataFrame(a=["a","a","b","b","b","b"], b=[1,1,2,2,3,3],c=11:16,notimportant=rand(6))

combine(groupby(df, [:a,:b])) do sdf; sdf[argmax(sdf.c), :]; end

Gives you:

3×4 DataFrame
 Row │ a       b      c      notimportant
     │ String  Int64  Int64  Float64
─────┼────────────────────────────────────
   1 │ a           1     12      0.641334
   2 │ b           2     14      0.737422
   3 │ b           3     16      0.27423

You can even put it in form of a function like:

function get_latest(df,id_symbols,sort_symbol)
    return combine(groupby(df, id_symbols)) do sdf; sdf[argmax(sdf[!,sort_symbol]), :]; end    
end

For anyone looking to keep more than 1 top result, i.e. the 5 greatest, I recommend looking at this answer.

How to filter a dataframe keeping the highest value of a certain column

2 Answers2