5

I have a data.table and I am trying to do something akin to data[ !is.na(variable) ]. However, for groups that are entirely missing, I'd like to just keep the first row of that group. So, I am trying to subset using by. I have done some research online and have a solution, but I think it is inefficient.

I've provided an example below showing what I am hoping to achieve, and I wonder if this can be done without creating the two extra columns.

d_sample = data.table( ID = c(1, 1, 2, 2, 3, 3), 
                   Time = c(10, 15, 100, 110, 200, 220), 
                   Event = c(NA, NA, NA, 1, 1, NA))

d_sample[ !is.na(Event), isValidOutcomeRow := T, by = ID]
d_sample[ , isValidOutcomePatient := any(isValidOutcomeRow), by = ID]
d_sample[ is.na(isValidOutcomePatient), isValidOutcomeRow := c(T, rep(NA, .N - 1)), by = ID]
d_sample[ isValidOutcomeRow == T ]

EDIT: Here are some speed comparisons with thelatemail and Frank's solutions with a larger dataset with 60K rows.

d_sample = data.table( ID = sort(rep(seq(1,30000), 2)), 
                   Time = rep(c(10, 15, 100, 110, 200, 220), 10000), 
                   Event = rep(c(NA, NA, NA, 1, 1, NA), 10000) )

thelatemail's solution gets a runtime of 20.65 on my computer.

system.time(d_sample[, if(all(is.na(Event))) .SD[1] else .SD[!is.na(Event)][1], by=ID])

Frank's first solution gets a runtime of 0

system.time( unique( d_sample[order(is.na(Event))], by="ID" ) )

Frank's second solution gets a runtime of 0.05

system.time( d_sample[order(is.na(Event)), .SD[1L], by=ID] )
vryb
  • 165
  • 1
  • 7

2 Answers2

7

This seems to work:

unique( d_sample[order(is.na(Event))], by="ID" )

   ID Time Event
1:  2  110     1
2:  3  200     1
3:  1   10    NA

Alternately, d_sample[order(is.na(Event)), .SD[1L], by=ID].


Extending the OP's example, I also find similar timings for the two approaches:

n = 12e4 # must be a multiple of 6
set.seed(1)
d_sample = data.table( ID = sort(rep(seq(1,n/2), 2)), 
                   Time = rep(c(10, 15, 100, 110, 200, 220), n/6), 
                   Event = rep(c(NA, NA, NA, 1, 1, NA), n/6) )

system.time(rf <- unique( d_sample[order(is.na(Event))], by="ID" ))
# 1.17
system.time(rf2 <- d_sample[order(is.na(Event)), .SD[1L], by=ID] )   
# 1.24
system.time(rt <- d_sample[, if(all(is.na(Event))) .SD[1] else .SD[!is.na(Event)], by=ID])    
# 10.42
system.time(rt2 <- 
    d_sample[ d_sample[, { w = which(is.na(Event)); .I[ if (length(w) == .N) 1L else -w ] }, by=ID]$V1 ] 
)
# .13

# verify
identical(rf,rf2) # TRUE
identical(rf,rt) # FALSE
fsetequal(rf,rt) # TRUE
identical(rt,rt2) # TRUE

The variation on @thelatemail's solution rt2 is the fastest by a wide margin.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • Clever - any indication if this is quicker? (I suspect it is) – thelatemail Oct 18 '16 at 00:18
  • @thelatemail Nope, I have no evidence of it. I know that `.SD[1]` when on its own has been optimized, so I suspect it's quicker (just because data.table won't recognize that your code in many cases simplifies to `.SD[1]`). – Frank Oct 18 '16 at 00:20
  • @thelatemail Oh, my code is wrong -- I should be keeping all rows for groups that don't have NAs, right? – Frank Oct 18 '16 at 00:21
  • 1
    I didn't think so, going by OP's expected result - maybe they can clarify? – thelatemail Oct 18 '16 at 00:24
  • maybe also `unique(d_sample[sort.list(Event)], by = "ID")` in order to avoid `is.na`. Also you could specify `method = "radix"` if `Event` is an integer (which seem to use `is.na` either- so nvm I guess) – David Arenburg Oct 18 '16 at 00:43
  • @Frank If the same ID only has missing `Event` values then I want to only keep the first row (which is still missing). Is this clear? I think all of your solutions work. – vryb Oct 18 '16 at 00:44
  • @DavidArenburg When I turn on verbose=TRUE, I see that order is replaced by forder. Does this achieve what you are expecting with method="radix"? – Frank Oct 18 '16 at 00:45
  • @vryb Ok. My code keeps exactly one row per group, even if there are multiple non-NA rows available. – Frank Oct 18 '16 at 00:46
  • 1
    @Frank Exactly, one row per ID no matter what. Your fastest solution also works really well when ID is keyed, as the unique() returns d_sample in order. – vryb Oct 18 '16 at 00:54
  • @vryb Ok, thanks for clarifying. Fyi, always good practice to verify that the results of the alternatives being benchmarked give the same result (as shown here). I also converted it to a function of `n` here (what I was referring to in my earlier comment). – Frank Oct 18 '16 at 01:05
  • 1
    @Frank - Makes sense, I will do that in the future. Thanks for the reply! – vryb Oct 18 '16 at 14:29
5

Here's an attempt that can probably be improved on, but relies on a quick if() logical check to determine what sort of result to return:

d_sample[, if(all(is.na(Event))) .SD[1] else .SD[!is.na(Event)], by=ID]
#   ID Time Event
#1:  1   10    NA
#2:  2  110     1
#3:  3  200     1

Following @eddi's workaround for subsetting by groups, this becomes...

d_sample[ d_sample[, { 
  w = which(is.na(Event))
  .I[ if (length(w) == .N) 1L else -w ] 
}, by=ID]$V1 ] 
Community
  • 1
  • 1
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thank you for the prompt reply! Is there a way to get around using .SD? While this code looks nice, it unfortunately runs really slowly, and I suspect that .SD is the causing some of the issues. – vryb Oct 18 '16 at 00:11
  • @vryb If you make an example that can be scaled up (like as a function of some `n`), it might be easier to pinpoint what's causing the slowdown. – Frank Oct 18 '16 at 00:15
  • Same idea `d_sample[ d_sample[, { w = which(is.na(Event)); .I[ if (length(w) == .N) 1L else -w ] }, by=ID]$V1 ] ` using eddi's trick: http://stackoverflow.com/questions/16573995/subset-by-group-with-data-table/16574176#16574176 – Frank Oct 18 '16 at 00:33