Select rows with min value by group

Question

I got a problems that bugs me for some time… hopefully anybody here can help me.

I got the following data frame

f <- c('a','a','b','b','b','c','d','d','d','d')
v1 <- c(1.3,10,2,10,10,1.1,10,3.1,10,10)
v2 <- c(1:10)
df <- data.frame(f,v1,v2)

f is a factor; v1 and v2 are values. For each level of f, I want only want to keep one row: the one that has the lowest value of v1 in this factor level.

I tried various things with aggregate, ddply, by, tapply… but nothing seems to work. For any suggestions, I would be very thankful.

score 32 · Accepted Answer · answered Nov 16 '10 at 00:47

Using DWin's solution, tapply can be avoided using ave.

df[ df$v1 == ave(df$v1, df$f, FUN=min), ]

This gives another speed-up, as shown below. Mind you, this is also dependent on the number of levels. I give this as I notice that ave is far too often forgotten about, although it is one of the more powerful functions in R.

f <- rep(letters[1:20],10000)
v1 <- rnorm(20*10000)
v2 <- 1:(20*10000)
df <- data.frame(f,v1,v2)

> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])
   user  system elapsed 
   0.05    0.00    0.05 

> system.time(df[ df$v1 %in% tapply(df$v1, df$f, min), ])
   user  system elapsed 
   0.25    0.03    0.29 

> system.time(lapply(split(df, df$f), FUN = function(x) {
+             vec <- which(x[3] == min(x[3]))
+             return(x[vec, ])
+         })
+  .... [TRUNCATED] 
   user  system elapsed 
   0.56    0.00    0.58 

> system.time(df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]
+ )
   user  system elapsed 
   0.17    0.00    0.19 

> system.time( ddply(df, .var = "f", .fun = function(x) {
+     return(subset(x, v1 %in% min(v1)))
+     }
+ )
+ )
   user  system elapsed 
   0.28    0.00    0.28

This is clearly the best answer! Great that you added the timings too. — Eduardo Leoni, Nov 16 '10 at 02:54
Thanks for that answer and the timings. How would I adapt that to more than one factor? Let's say I have f1 and f2 as factors, and I want the min-value for each combination... I tried both c() and list(), but it didn't work. — donodarazao, Nov 16 '10 at 04:08
@donodarazao : see `?ave` : ave(x,factor1,factor2,factor3,factor4,...,FUN=min) — Joris Meys, Nov 16 '10 at 04:25
@hadley: thx, I didn't even realize I corrected a tricky bug. — Joris Meys, Nov 16 '10 at 04:26

mnel · Answer 2 · 2013-10-28T22:10:39.667

A data.table solution.

library(data.table)
DT <- as.data.table(df)
DT[,.SD[which.min(v1)], by = f]

##   f  v1 v2
## 1: a 1.3  1
## 2: b 2.0  3
## 3: c 1.1  6
## 4: d 3.1  8

Or, more efficiently

DT[DT[,.I[which.min(v1)],by=f][['V1']]]

some benchmarking

f <- rep(letters[1:20],100000)
v1 <- rnorm(20*100000)
v2 <- 1:(20*100000)
df <- data.frame(f,v1,v2)
DT <- as.data.table(df)
f1<-function(){df2<-df[order(df$f,df$v1),]
               df2[!duplicated(df2$f),]}

f2<-function(){df2<-df[order(df$v1),]
               df2[!duplicated(df2$f),]}

f3<-function(){df[ df$v1 == ave(df$v1, df$f, FUN=min), ]}


f4 <- function(){DT[,.SD[which.min(v1)], by = f]}

f5 <- function(){DT[DT[,.I[which.min(v1)],by=f][['V1']]]}

library(microbenchmark)
microbenchmark(f1(),f2(),f3(),f4(), f5(),times = 5)
# Unit: milliseconds
# expr       min        lq    median        uq       max neval
# f1() 3254.6620 3265.4760 3286.5440 3411.4054 3475.4198     5
# f2() 1630.8572 1639.3472 1651.5422 1721.4670 1738.6684     5
# f3()  172.2639  174.0448  177.4985  179.9604  184.7365     5
# f4()  206.1837  209.8161  209.8584  210.4896  210.7893     5
# f5()  105.5960  106.5006  107.9486  109.7216  111.1286     5

The .I approach is the winner (FR #2330 will hopefully render the elegance of the .SD approach similarly fast when implemented).

score 8 · Answer 3 · answered Nov 15 '10 at 23:33

8

With plyr, I'd use:

ddply(df, .var = "f", .fun = function(x) {
    return(subset(x, v1 %in% min(v1)))
    }
)

Give that a try and see if it returns what you want.

answered Nov 15 '10 at 23:33

Matt Parker

26,709
7
54
72

9

Or more simply: `ddply(df, "f", subset, v1 == min(v1))` – hadley Nov 16 '10 at 02:49
1

Well, solution is pretty, but with many levels and few values per level only is extremely slow... – JelenaČuklina Dec 22 '15 at 13:39

score 6 · Answer 4 · edited Aug 02 '23 at 14:37

This is the dplyr-way to filter for the minimum v1 values by groups of f:

library(dplyr)
df |>
  group_by(f) |>
  slice_min(v1)

See the ?slice_min help page for options such as whether or not to include ties (default includes them), or options to keep more than 1 lowest value (e.g., bottom 5 or bottom 10%).

You can also do it more explicitly:

df %>%
  group_by(f) %>%
  filter(v1 == min(v1))

#Source: local data frame [4 x 3]
#Groups: f
#
#  f  v1 v2
#1 a 1.3  1
#2 b 2.0  3
#3 c 1.1  6
#4 d 3.1  8

In cases of ties in v1, this would result in multiple rows per group of f. If you want to avoid that, you can use:

df %>% 
  group_by(f) %>% 
  filter(rank(v1, ties.method= "first") == 1)

This way, you'll only get the first row in case of ties. You could alternatively use ties.method = "random" or others as described in the help file.

mbq · Answer 5 · 2010-11-16T14:04:01.173

6

Another tapply solution, with no unnecessary scanning of vector with %in%:

df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]

EDIT: This will left only first row in case of a tie.

EDIT2: Impressed by ave, I've made additional improvements:

df[sapply(split(1:nrow(df),df$f),function(x) x[which.min(df$v1[x])]),]

On my machine (using Joris' benchmark data):

> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])
   user  system elapsed
  0.022   0.000   0.021
> system.time(df[sapply(split(1:nrow(df),df$f),function(x) x[which.min(df$v1[x])]),])
   user  system elapsed
  0.006   0.000   0.007

edited Nov 16 '10 at 14:04

answered Nov 16 '10 at 00:13

mbq

18,510
6
49
72

Well, mbq, I voted yours up. And like Matt with mine, I had to work yours through to "see" its inner workings. – IRTFM Nov 16 '10 at 00:28
@DWin Sorry for this %in% remark, I maybe tend to overestimate efficiency :| And I agree that all solutions are pretty complicated; `by` solution is IMO quite readable but the result is awful (-; – mbq Nov 16 '10 at 00:41

IRTFM · Answer 6 · 2010-11-16T00:11:05.387

3

Here's a tapply solution;

> df[ df$v1 %in% tapply(df$v1, df$f, min), ]

  f  v1 v2
1 a 1.3  1
3 b 2.0  3
6 c 1.1  6
8 d 3.1  8

In your example it only picks out one per group, but if there were ties this method would show them all. (As would Parker's and Luštrik's I suspect.)

edited Nov 16 '10 at 00:11

answered Nov 16 '10 at 00:06

IRTFM

258,963
21
364
487

Excellent point regarding ties. Nice function, too - took me a few reads and a glance at ?tapply to figure out what was going on. – Matt Parker Nov 16 '10 at 00:13
Wow, amazing! The ddply-solution from Matt works, but took around 2 min in my real data frame (~10,000 rows). This solution gives the same result, but takes less than a 1 second. Pretty elegant, thanks! – donodarazao Nov 16 '10 at 00:29
4

Something is going wrong if ddply takes 2 minutes on 10,000 observations. Plus this method is not always going to return the correct results - think about the case where 2 is the lowest value for one group and the second lowest for another group. It's just luck that it works for this example. – hadley Nov 16 '10 at 02:51
Maybe it's because there are 7,700 factor levels in the 10,000 observations? I tried it again with your ddply-solution and it really takes that long... – donodarazao Nov 16 '10 at 04:08
As per hadley's point, this solution is incorrect. Test on input `df = data.frame(f = c("a", "a", "b", "b"), v1 = c(2, 3, 1, 2))` - correct result is a 2-row data frame, this code produces 3 rows. The answer really should be corrected or deleted. – Gregor Thomas Aug 02 '23 at 14:49

score 2 · Answer 7 · answered Mar 06 '13 at 06:51

2

Here is a solution with by

do.call(rbind, unname(by(df, df$f, function(x) x[x$v1 == min(x$v1),])))
##   f  v1 v2
## 1 a 1.3  1
## 3 b 2.0  3
## 6 c 1.1  6
## 8 d 3.1  8

answered Mar 06 '13 at 06:51

Matthew Lundberg

42,009
6
90
112

score 2 · Answer 8 · answered Nov 15 '10 at 23:44

2

I'm sorry, my thinking power is depleted, and this ugly solution is all I can come up with at almost 1 am.

lapply(split(df, df$f), FUN = function(x) {
            vec <- which(x[3] == min(x[3]))
            return(x[vec, ])
        })

answered Nov 15 '10 at 23:44

Roman Luštrik

69,533
24
154
197

James · Answer 9 · 2010-11-16T16:56:25.047

2

Another way is to use order and !duplicated, but you would only get the first on ties.

df2 <- df[order(df$f,df$v1),]
df2[!duplicated(df2$f),]

  f  v1 v2
1 a 1.3  1
3 b 2.0  3
6 c 1.1  6
8 d 3.1  8

Timings

f1<-function(){df2<-df[order(df$f,df$v1),]
df2[!duplicated(df2$f),]}

f2<-function(){df2<-df[order(df$v1),]
df2[!duplicated(df2$f),]}

f3<-function(){df[ df$v1 == ave(df$v1, df$f, FUN=min), ]}

library(rbenchmark)
> benchmark(f1(),f2(),f3())
  test replications elapsed relative user.self sys.self user.child sys.child
1 f1()          100   38.16 7.040590     36.66     1.48         NA        NA
2 f2()          100   20.54 3.789668     19.30     1.23         NA        NA
3 f3()          100    5.42 1.000000      4.96     0.46         NA        NA

edited Nov 16 '10 at 16:56

answered Nov 16 '10 at 15:54

James

65,548
14
155
193

Order over `v1` is sufficient. Nice solution, how about timing? – Marek Nov 16 '10 at 16:06
@Marek Thanks, didn't think about not really needing to order on `f`. This seems to speed it up by about 2x, but its still a fair bit slower than Joris Meys' `ave` solution. – James Nov 16 '10 at 16:51

RCchelsie · Answer 10 · 2021-09-15T18:51:04.983

0

Using tidyverse

df %>%
  arrange(v1) %>% # You can also do arrange(f, v1)
  distinct(f, .keep_all = TRUE)

I also like the previous answer from @talat

df %>%
  group_by(f) %>%
  filter(v1 == min(v1))

but the first one avoid grouping and ungrouping.

edited Sep 15 '21 at 18:51

answered Sep 08 '21 at 20:23

RCchelsie

111
6

1

It is my personal opinion that it looks very similar to the answer of @talat. – Guannan Shen Sep 08 '21 at 22:12

Select rows with min value by group

10 Answers10

some benchmarking

Linked

Related