4

Possible Duplicate:
Only keep min value for each factor level

Here is my problem, I want to select rows with minimum value in specified column. For example:

df <- data.frame(A=c("a","a","b","b"),value=1:4)

The result I want is

 A value
 a     1
 b     3

I could do with by and ddply, but they are quite slow when df is huge and has many different value in A.

do.call(rbind,by(df,df$A, function(x) x[which.min(abs(x$value)),],simplify=FALSE))

ddply(df, ~A, function(x){x[which.min(abs(x$value)),]})

Any suggestions?

Thanks a lot!

Community
  • 1
  • 1
ccshao
  • 499
  • 2
  • 8
  • 19
  • I selected that possible duplicate because it has some benchmarks in it. Hope it helps. – Matt Dowle Nov 21 '12 at 18:19
  • Btw, in case some search tricks help, I found that question by searching for "[r] +which.min +benchmark" which returned that single question. The trick I suppose is going from the word "efficient" to searching for "benchmark". – Matt Dowle Nov 21 '12 at 18:30

2 Answers2

2

data.table is quite fast for large data.frames if you set the key.

dt <- data.table(df, key="A")
dt[, list(value=min(value)), by=A]

References:

Erik Shilts
  • 4,389
  • 2
  • 26
  • 51
  • 1
    Interesting. How to return other columns instead of "value" and "A"? – ccshao Nov 21 '12 at 18:43
  • You can have multiple "by" columns by passing the column names as a vector (e.g. `by=c("A", "Bcolumn", "Ccolumn")`). You can compute multiple statistics by including them in the list call (e.g. `list(min_value=min(value), max_value=max(value))`. – Erik Shilts Nov 21 '12 at 18:49
  • 1
    Sorry, I didnt make myself clear. Suppose there are three column, "B", in df, with the command I only got "value" and "A", how to make to output column "B" as well. – ccshao Nov 21 '12 at 18:54
  • That depends on what you want to calculate. If you want the minimum by A and B then you'll want to use my `by` syntax above. If you want to calculate something on B then you'll use the `list` syntax but replace value with B. If you want something else then example data would help as the syntax will differ depending on what you want. – Erik Shilts Nov 21 '12 at 19:00
0

tapply does this:

> tapply(df$value, df$A, min)
a b 
1 3 

Edited: Using by instead of tapply, we can retain the row names:

df <- data.frame(A=c("a","a","b","b"),value=11:14)
df
##   A value
## 1 a    11
## 2 a    12
## 3 b    13
## 4 b    14

do.call(rbind, unname(by(df, df$A, function(x) x[x$value == min(x$value),])))
##   A value
## 1 a    11
## 3 b    13
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112