17

I'm grouping a data.table and want to select from each group the first row where x == 1 or, if such a row does not exist, then the first row with any value in x

d <- data.table(
           a = c(1,1,1,  2,2,  3,3), 
           x = c(0,1,0,  0,0,  1,1), 
           y = c(1,2,3,  1,2,  1,2)
)

this attempt

d[, ifelse(any(.SD[,x] == 1),.SD[x == 1][1], .SD[1]), by = a]

returns

   a V1
1: 1  1
2: 2  0
3: 3  1

but i expected

   a  x  y
1: 1  1  2
2: 2  0  1
3: 3  1  1

Any ideas how to get it right?

Steffen J.
  • 694
  • 5
  • 16
  • `ifelse` will return the same length as the length of the first argument.. (see `?ifelse`) which in this case is single logical value. I use `ifelse` as a last resort. Even then, in most cases I don't. And I recommend to stay away from it. Besides all these things to remember, it's slow as well. – Arun Aug 25 '16 at 20:22

3 Answers3

15

I think it's a good use case for both match and it's nomatch argument

d[, .SD[match(1L, x, nomatch = 1L)], by = a]
#    a x y
# 1: 1 1 2
# 2: 2 0 1
# 3: 3 1 1

This is basically, in case of no-match, returns 1, and as a result gives you the first row in the group. If there is a multiple match, then it will return the first one- as per your desire

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
15

Another option (which.max is basically designed to do exactly what you want):

d[, .SD[which.max(x == 1)], by = a]
#   a x y
#1: 1 1 2
#2: 2 0 1
#3: 3 1 1
eddi
  • 49,088
  • 6
  • 104
  • 155
6

We can also do this with .I to return the row index and use that for subsetting the rows.

d[d[, .I[which.max(x==1)], by = a]$V1]
#   a x y
#1: 1 1 2
#2: 2 0 1
#3: 3 1 1

In the current versions of data.table, .I approach is more efficient compared to the .SD for subsetting rows (However, it could change in the future). This is also a similar post


Here is another option with order (setkey can also be used - for efficiency) the dataset by 'a' and 'x' after grouping by 'a', and then get the first row with head

d[order(a ,-x), head(.SD, 1) ,by = a]
#   a x y
#1: 1 1 2
#2: 2 0 1
#3: 3 1 1

Benchmarks

Initially, we were thinking about benchmarking on > 1e6, but the .SD methods are taking time, so comparing on 3e5 rows using data.table_1.9.7

set.seed(24)
d1 <- data.table(a = rep(1:1e5, 3), x = sample(0:1, 1e5*3, 
           replace=TRUE), y = rnorm(1e5*3))

system.time(d1[, .SD[which.max(x == 1)], by = a])
#   user  system elapsed 
#  56.21   30.64   86.42 

system.time(d1[, .SD[match(1L, x, nomatch = 1L)], by = a])
# user  system elapsed 
#  55.27   30.07   83.75 

system.time(d1[d1[, .I[which.max(x==1)], by = a]$V1])
#  user  system elapsed 
#   0.19    0.00    0.19 


system.time(d1[order(a ,-x), head(.SD, 1) ,by = a])
# user  system elapsed 
#   0.03    0.00    0.04 
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 4
    A familiar variation from http://stackoverflow.com/a/16574176/ Might be worth mentioning the probable performance improvement vs the .SD[stuff] way. – Frank Aug 25 '16 at 22:27
  • thanks! didn't know about .I before. Any explanation why my attempt with the ifelse statement doesn't work? – Steffen J. Aug 26 '16 at 06:11
  • @Frank I'm not sure that in the newer versions (>=1.9.7) there is still a performance gain – David Arenburg Aug 26 '16 at 07:00
  • @SteffenJ. as Arun commented in the post, it is better not to use `ifelse`. Here, you want to subset only a single row for each 'a'. In that case, you can use `if`. i.e. `d[, if(any(.SD[,x] == 1)) .SD[x == 1][1] else .SD[1], by = a]` However, this could be slow because we are using `.SD` three times and not recommended. Another case is `ifelse` works on `vectors` – akrun Aug 26 '16 at 07:00
  • On version 1.9.7? – David Arenburg Aug 26 '16 at 07:04
  • 4
    Eitherway, your answer is just a verbatim of both eddis answers. I don't think you should have posted this as your own. – David Arenburg Aug 26 '16 at 07:06
  • Just for reference, optimizing `.SD[which.max(...)]` is on TODO in [data.table#523](https://github.com/Rdatatable/data.table/issues/523). – jangorecki Aug 26 '16 at 10:11
  • `d[order(a ,-x), head(.SD, 1) ,by = a]` won't keep the original order of `x` so I don't think it will give desired result in all cases as OP wants first row rather the highest value. If order doesn't matter you could as well just do `unique(d[order(-x)], by = "a")` – David Arenburg Aug 26 '16 at 11:01
  • The `order` answer is clearly very limited/wrong. It works iff 1 is the largest value. – eddi Aug 26 '16 at 15:14
  • @eddi I mentioned to DavidArenburg earlier (comment deleted) that it works for the binary variable. – akrun Aug 26 '16 at 15:15
  • 2
    Yeah, I don't think anyone would believe you copied it verbatim. We've all internalized the idiom and used it many times in our answers and work, I think (though I always try to cite the original, since I'm sure I'd never have thought of it on my own). – Frank Sep 02 '16 at 20:00