Select rows of a DataFrame containing minimum of grouping variable in Julia

Question

I'm wondering if there is an efficient way to do the following in Julia:

I have a DataFrame of the following form:

julia> df1 = DataFrame(var1=["a","a","a","b","b","b","c","c","c"],
                var2=["p","q","r","p","p","r","q","p","p"],
                var3=[1,2,3,2,5,4,6,7,8])
9×3 DataFrame
│ Row │ var1   │ var2   │ var3  │
│     │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1   │ a      │ p      │ 1     │
│ 2   │ a      │ q      │ 2     │
│ 3   │ a      │ r      │ 3     │
│ 4   │ b      │ p      │ 2     │
│ 5   │ b      │ p      │ 5     │
│ 6   │ b      │ r      │ 4     │
│ 7   │ c      │ q      │ 6     │
│ 8   │ c      │ p      │ 7     │
│ 9   │ c      │ p      │ 8     │

And I want to return a DataFrame that contains the same columns but only the rows where var3 has its minimum value within groups according to var1.

I have tried using the split-apply-combine approach but couldn't seem to find a way to filter the rows while returning all columns.

Appreciate any help on this.

Also please note that you are using DataFrames.jl version that is currently not maintained. I recommend you to update to the latest 0.22 version of the package. — Bogumił Kamiński, Nov 26 '20 at 16:29

score 5 · Answer 1 · answered Nov 26 '20 at 16:02

One possible way to do it:

julia> DataFrame([g[findmin(g.var3)[2],:] for g in groupby(df1, :var1)])
3×3 DataFrame
│ Row │ var1   │ var2   │ var3  │
│     │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1   │ a      │ p      │ 1     │
│ 2   │ b      │ p      │ 2     │
│ 3   │ c      │ q      │ 6     │

score 5 · Accepted Answer · answered Nov 26 '20 at 16:14

An alternative way to do it if you do not have duplicates in :var3 per group is:

julia> combine(sdf -> sdf[argmin(sdf.var3), :], groupby(df1, :var1))
3×3 DataFrame
 Row │ var1    var2    var3
     │ String  String  Int64
─────┼───────────────────────
   1 │ a       p           1
   2 │ b       p           2
   3 │ c       q           6

If you may have duplicates then use:

julia> combine(sdf -> filter(:var3 => ==(minimum(sdf.var3)), sdf), groupby(df1, :var1))
3×3 DataFrame
 Row │ var1    var2    var3
     │ String  String  Int64
─────┼───────────────────────
   1 │ a       p           1
   2 │ b       p           2
   3 │ c       q           6

instead.

Another example handling duplicates correctly is:

julia> combine(sdf -> first(groupby(sdf, :var3, sort=true)), groupby(df1, :var1))
3×3 DataFrame
 Row │ var1    var2    var3
     │ String  String  Int64
─────┼───────────────────────
   1 │ a       p           1
   2 │ b       p           2
   3 │ c       q           6

it is not very efficient in this case but shows you how you can work with groupby in DataFrames.jl.

If I have duplicates (or ties) but I would like to keep both rows if this occurs, will the first option do that? — Kayvon Coffey, Nov 26 '20 at 17:03
No - this is the point. The first option picks only one row like the solution by Przemysław. — Bogumił Kamiński, Nov 26 '20 at 17:55

Select rows of a DataFrame containing minimum of grouping variable in Julia

2 Answers2

Linked