Extracting indices for data frame rows that have MAX value for named field

Question

I have a data frame that is rather large and I need a good way (explained bellow) to extract indices for rows that have maximum values for a given field, within a certain set of labels. To explain this a bit better, here is an example 10 row data frame:

      value label
1  5.531637     D
2  5.826498     A
3  8.866210     A
4  1.387978     C
5  8.128505     C
6  7.391311     B
7  1.829392     A
8  4.373273     D
9  7.380244     A
10 6.157304     D

To generate:

structure(list(value = c(5.531637, 5.826498, 8.86621, 1.387978, 8.128505, 
7.391311, 1.829392, 4.373273, 7.380244, 6.157304), 
label = c("D", "A", "A", "C", "C", "B", "A", "D", "A", "D")), 
.Names = c("value", "label"), class = "data.frame", row.names = c(NA, -10L))

If I want to know what the index is for rows that have the maximum value per label, I currently use the following code:

idx <- sapply(split(1:nrow(d), d$label), function(x) {
  x[which.max(d[x,"value"])]
})

Generating this answer:

A  B  C  D 
3  6  5 10

I have also played around with ddply but have yet to find a better way to do this. By "better" in this case I mean faster (ddply is pretty slow and what I currently use is not far behind) as well as more elegant since the above solution seems way to wordy too me.

You forgot about one important thing: how many unique labels are in your `data.frame` (compared to data size = average number of rows per label)? If not many (say 100 labels / 100k rows) then `sapply(split())` is one of fastest methods. — Marek, May 17 '11 at 07:51
if you give code for a dataframe (thank you for that), be sure to provide a seed with `set.seed()` so our results will match yours. — Joris Meys, May 17 '11 at 09:16
@Joris My bad, just updated my post with the dput() output. @Marek I will usually be dealing with multiple sets of ~70k unique labels, 10 sets at a time. I found that versus ddply() my method is ~2x faster. — diliop, May 17 '11 at 18:47

score 6 · Accepted Answer · edited May 23 '17 at 09:59

First of all: you can get the speed up using:

idx <- sapply(split(seq_len(nrow(d)), d$label), function(x) {
      x[which.max(d$value[x])]})

For a 100k data.frame, on my machine it is 5x faster than d[x,"value"] version.

For a large data.frame and many labels you could use a similar method that I posted in earlier question:

dd <- d[i<-order(d$label, d$value),] # dd is sorted by label and value
ind <- c(dd$label[-1] != dd$label[-n], TRUE)
idx <- setNames(seq_len(nrow(d))[i][ind], dd$label[ind])

edit: A more efficient solution with the use of a trick from Martin Morgan answer:

v <- d$label[i<-order(d$value)] # we need only label, and with Martin
                                # trick sorting over label is not needed
ind <- !duplicated(v, fromLast=TRUE) # it finds last (max) occurrence of label
idx <- setNames(seq_len(nrow(d))[i][ind], v[ind])

NOTE: order of final vector is different.

It depends on your actual data structure but you should gain a nice speed-up:

Timings:

# NOTE: different machine, so timing differ from previous
set.seed(6025051)
n <- 100000; k <- 20000
d <- data.frame(value=rnorm(n), 
    label=sample(paste("A",seq_len(k),sep="_"), n, replace=TRUE))

system.time(
    idx_1 <- sapply(split(1:nrow(d), d$label), function(x) {
        x[which.max(d[x,"value"])]})
)
# user  system elapsed 
# 1.30    0.02    1.31 
system.time(
    idx_1b <- sapply(split(seq_len(nrow(d)), d$label), function(x) {
        x[which.max(d$value[x])]})
)
# user  system elapsed 
# 0.23    0.00    0.23
all.equal(idx_1, idx_1b)
# [1] TRUE
system.time({
    dd <- d[i<-order(d$label, d$value),]
    ind <- c(dd$label[-1] != dd$label[-n], TRUE)
    idx_2 <- setNames(seq_len(nrow(d))[i][ind],dd$label[ind])
})
# user  system elapsed 
# 0.19    0.00    0.19 
all.equal(idx_1, idx_2)
# [1] TRUE

new solution

system.time({
    v <- d$label[i<-order(d$value)]
            ind <- !duplicated(v, fromLast=TRUE)
            idx_3 <- setNames(seq_len(nrow(d))[i][ind], v[ind])
})
# user  system elapsed 
# 0.05    0.00    0.04 
all.equal(sort(idx_1), sort(idx_3))
# [1] TRUE

Thats awesome! Was debating whether I should leverage sort to do my biding but you got it right. Quick profiling of your last solution versus my initial method gives ~30x speed increase. +1 for that. — diliop, May 18 '11 at 00:42
Nice use of `fromLast=TRUE` in your May 31 revision. That's new to me. — Aaron left Stack Overflow, May 31 '11 at 15:35
@Aaron Credits to Martin Morgan and [his answer](http://stackoverflow.com/questions/6167791/efficient-functional-programming-using-mapply-in-r-for-a-naturally-procedural/6167954#6167954) — Marek, May 31 '11 at 21:31

score 3 · Answer 2 · answered May 17 '11 at 07:44

3

Perhaps this may help:

tapply(seq(dim(d)[1]), d$label, function(rns){rns[which.max(d$value[rns])]} )

(note: I got this trick from the code of 'by')

answered May 17 '11 at 07:44

Nick Sabbe

11,684
1
43
57

+1 for the speed increase. With some rough profiling on a small sample set of 1mil rows and ~520k unique labels, I am getting a 3x speed increase. – diliop May 17 '11 at 19:43

score 3 · Answer 3 · answered May 18 '11 at 16:33

You could speed it up a little faster by writing it in C; this question gave me the excuse to try Rcpp and inline; I'm sure the code could be written better as this is my first go.

Here's the code:

library(Rcpp)
library(inline)

src <- '
  Rcpp::NumericVector xx(x);
  Rcpp::IntegerVector gg(g);
  Rcpp::NumericVector mx(m);
  Rcpp::IntegerVector wh(w);
  int nx = xx.size();
  for(int i = 0; i < nx; i++) {
    if( xx[i] > mx[gg[i]-1] ) {
      mx[gg[i]-1] = xx[i];
      wh[gg[i]-1] = i+1;
    }
  }
  return wh;
'

fun <- cxxfunction(signature(x="numeric", g="integer", 
                             m="numeric", w="integer"), 
                   src, plugin="Rcpp")

maxg <- function(x, g) {
  g <- factor(g)
  n <- nlevels(g)
  out <- fun(x=x, g=as.integer(g), m=rep(-Inf, n), w=integer(n))
  names(out) <- levels(g)
  out
}

Using Marek's data,

set.seed(6025051)
n <- 100000; k <- 20000
d <- data.frame(
  value=rnorm(n),
  label=sample(paste("A", seq_len(k), sep="_"), n, replace=TRUE)
)

it's about 4x faster than Marek's $ solution on my system.

system.time({
    idx_1b <- sapply(split(1:nrow(d), d$label), function(x) {
        x[which.max(d$value[x])]})
})
#   user  system elapsed 
#  0.209   0.000   0.208 

system.time({
  idx_c <- maxg(d$value, d$label)
})
#   user  system elapsed 
#  0.049   0.000   0.048 

all.equal(idx_1b, idx_c)
# [1] TRUE

Interestingly, Marek's additional solution (which I don't yet understand, btw), is only marginally faster than the $ solution on my system.

system.time({
  dd <- d[i <- order(d$label, d$value),]
  ind <- c(dd$label[-1] != dd$label[-n], TRUE)
  idx_2 <- setNames(seq_len(nrow(d))[i][ind],dd$label[ind])
})
#   user  system elapsed 
#  0.198   0.001   0.199

Extracting indices for data frame rows that have MAX value for named field

3 Answers3

Timings:

new solution

Linked

Related