3

Given the dataframe below, I want to filter records that shares the same q2, id_q, check_id and keep only the ones with the highest value.

input dataframe:

q1 q2 id_q check_id value
sdfsdf dfsdfsdf 10 10 90
hdfhhd dfsdfsdf 10 10 80

There are 2 q2 with same id_q, check_id but with different values: 90,80.

I want to return for the same q2, id_q, check_id the line with the highest value. For example above the output is:

So I want to drop duplicates regarding to: check_id and id_q and keep the one with the highest value of valuecolumn

Desired Output:

q1 q2 id_q check_id value
sdfsdf dfsdfsdf 10 10 90
Sundar R
  • 13,776
  • 6
  • 49
  • 76
noob4ever
  • 117
  • 5
  • Looks like you want to `groupby` by your wanted columns `q2, id_q, check_id` and aggregate column `value` by the `max` function. – Jonas Palačionis Nov 09 '22 at 08:37
  • 4
    Does this answer your question? [Select rows of a DataFrame containing minimum of grouping variable in Julia](https://stackoverflow.com/questions/65024962/select-rows-of-a-dataframe-containing-minimum-of-grouping-variable-in-julia) – Nils Gudat Nov 09 '22 at 09:34
  • 3
    This is basically a duplicate of https://stackoverflow.com/questions/65024962/select-rows-of-a-dataframe-containing-minimum-of-grouping-variable-in-julia (with max instead of min) – Nils Gudat Nov 09 '22 at 09:34

2 Answers2

0

For this case this code seems to be shorter that the ones referenced in other answers:

Suppose you have

julia> df = DataFrame(a=["a","a","b","b","b","b"], b=[1,1,2,2,3,3],c=11:16,notimportant=rand(6))
6×4 DataFrame
 Row │ a       b      c      notimportant
     │ String  Int64  Int64  Float64
─────┼────────────────────────────────────
   1 │ a           1     11      0.93785
   2 │ a           1     12      0.877777
   3 │ b           2     13      0.845306
   4 │ b           2     14      0.477606
   5 │ b           3     15      0.722569
   6 │ b           3     16      0.122807

Than you can just do:

julia> combine(groupby(df, [:a, :b]), :c => maximum => :c)
3×3 DataFrame
 Row │ a       b      c
     │ String  Int64  Int64
─────┼──────────────────────
   1 │ a           1     12
   2 │ b           2     14
   3 │ b           3     16
Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
  • Note that this doesn't handle the columns that are neither part of the `groupby` key nor the maximised column (eg. `q1` in the dataframe in the question). For those to be part of the output, you'll have to do some kind of `argmax`-indexing, and at that point the answer in the linked question is probably the shorter, simpler one. – Sundar R Nov 10 '22 at 05:25
  • @SundarR It seems that this just ignores those columns which is the expected behavior as I assume. I edited the answer to show that. Or am I misunderstood something? - please let me know! – Przemyslaw Szufel Nov 10 '22 at 11:36
  • OP's "Desired Output" does contain `q1`, that's why I assume ignoring it isn't the expected behaviour for them. But it's useful to have this as one of the solutions, for anyone else who does want this behaviour. – Sundar R Nov 12 '22 at 09:37
  • Ah OK - you are right in that case `argmax` is in duplicate questions – Przemyslaw Szufel Nov 12 '22 at 10:23
0

Expanding on the answer by Przemyslaw Szufel.

In case anyone is interested in also keeping the rest of the columns based on the argmax you could use something like this:

using DataFrames: DataFrame
df = DataFrame(a=["a","a","b","b","b","b"], b=[1,1,2,2,3,3],c=11:16,notimportant=rand(6))

combine(groupby(df, [:a,:b])) do sdf; sdf[argmax(sdf.c), :]; end

Gives you:

3×4 DataFrame
 Row │ a       b      c      notimportant
     │ String  Int64  Int64  Float64
─────┼────────────────────────────────────
   1 │ a           1     12      0.641334
   2 │ b           2     14      0.737422
   3 │ b           3     16      0.27423

You can even put it in form of a function like:

function get_latest(df,id_symbols,sort_symbol)
    return combine(groupby(df, id_symbols)) do sdf; sdf[argmax(sdf[!,sort_symbol]), :]; end    
end

For anyone looking to keep more than 1 top result, i.e. the 5 greatest, I recommend looking at this answer.