1

I'm running into an issue when doin grouping and which.max with R data.table, and I'm not sure whether it's a bug, or I'm not understanding the group-by structures in data.table correctly. I have a work-around, I'm just trying to understand why my initial try failed.

I'm looking at a table containing time series, and I want to get either (a) the time an event of interest occurred, or (b) the final time stamp in the time series. The column marking events is "NA" if an event did not occur, and "1" if it did.

Here's a minimal example to reproduce the issue:

dt <- data.table(t = seq(9), event = c(NA, NA, NA, NA, 1, NA, 1, NA, NA), t_id = c(rep('A', 3), rep('B', 3), rep('C', 3)))
dt[, ifelse(is.null(which.max(event)), max(t), t[which.max(event)]), by=t_id]

This returns

t_id    V1
   A    NA
   B    5
   C    7

Where the value for group "A" is NA (I would naively expect it to be 3). If I run this without the ifelse function

dt[, t[which.max(event)], by=t_id]

the row for "A" is simply missing (which.max returns NULL). But if I run

dt[, is.null(which.max(event)), by=t_id]

I get

t_id    V1
   A    FALSE
   B    FALSE
   C    FALSE

What am I missing?

snooze_bear
  • 589
  • 6
  • 14
  • 1
    So what exactly are you expecting to get from `is.null(which.max(event))`? Did you ever see `which.max` returns a `NULL`? If I were you, I would first try to debug from inside out. For instance, it worth checking what you get from `is.null(which.max(NA))`. One way to tackle this, could be `dt[, if(all(is.na(event))) max(t) else t[which.max(event)], by = t_id]` – David Arenburg Apr 30 '18 at 17:44
  • It's not NULL, but it's not anything else I can identify--the class is "integer" and I don't see a way to recover the value, moreover is.nan(which.max(NA)) and is.na(which.max(NA)) return logicals with no value. – snooze_bear Apr 30 '18 at 17:50
  • 1
    You can use `all(is.na(event))`. And, at all cost, try avoiding `ifelse` when it isn't necessary (like in this case) – David Arenburg Apr 30 '18 at 17:52
  • Generally, when you want to select one row per group, you can sort and then use `unique`, I guess in this case... `unique(dt[order(t_id, !is.na(event), t)], by="t_id", fromLast=TRUE)`? Btw, you should just record event occurrence using TRUE/FALSE, not the nonstandard 1/NA, since most of R syntax plays nicely with the former. – Frank Apr 30 '18 at 19:12
  • In my application the NAs come from an outer join with another table, so possible values are 0, 1, NA, that's where that came from. – snooze_bear May 02 '18 at 19:46
  • 1
    @Frank, I that actually gives a cleaner solution than my workaround, which was still using ifelse. My application has both NA and 0 to indicate no event, that is why I was using max. – snooze_bear May 02 '18 at 20:03

1 Answers1

0

Would this work:

library(data.table)
dt <- data.table(t = seq(9), 
                 event = c(NA, NA, NA, NA, 1, NA, 1, NA, NA), 
                 t_id = c(rep('A', 3), rep('B', 3), rep('C', 3)))
dt[, ifelse(length((na.omit(event)))==0, which.max(t), t[which.max(event)] ), by=t_id]


> dt[, ifelse(length((na.omit(event)))==0, which.max(t), t[which.max(event)] ), by=t_id]
   t_id V1
1:    A  3
2:    B  5
3:    C  7

The problem with your approach is that which.max() does not return a NULL object:

> is.null(which.max(c(NA,NA,NA)))
[1] FALSE

But if you try length() you get the expected result:

> length(which.max(c(NA,NA,NA))) == 0
[1] TRUE

Also I understand that the event column might contain values other than 1 and NA. Like:

dt <- data.table(t = seq(9), event = c(NA, NA, NA, NA, 1,3, 5, 2, NA, 2, 1, NA, NA), t_id = c(rep('A', 3), rep('B', 6), rep('C', 4)))
dt[, ifelse(length((na.omit(event)))==0, which.max(t), t[which.max(event)] ), by=t_id]

In this case the data table looks like:

> dt
    t event t_id
 1: 1    NA    A
 2: 2    NA    A
 3: 3    NA    A
 4: 4    NA    B
 5: 5     1    B
 6: 6     3    B
 7: 7     5    B
 8: 8     2    B
 9: 9    NA    B
10: 1     2    C
11: 2     1    C
12: 3    NA    C
13: 4    NA    C

And the result:

> dt[, ifelse(length((na.omit(event)))==0, which.max(t), t[which.max(event)] ), by=t_id]
   t_id V1
1:    A  3
2:    B  7
3:    C  1

If the event column can contain only NAs and 1s, than the solution would be much simpler.

Another note (please see the discussion below): For this case there ifelse(na.omit()) might not be the most efficient - a few suggestions can be found in comments.

Katia
  • 3,784
  • 1
  • 14
  • 27
  • Both `ifelse` and `na.omit` are slow, have unexpected behavior and completely unnecessary in this case – David Arenburg May 01 '18 at 21:03
  • The main point of this exercise was to highlight the problem in the original code and explain why it did not work. I disagree that ifelse() is slow. When properly used it is fine. Can this code be further optimized - of course! But the question was why the original code did not work, and I was trying to explain why and how to make it work. – Katia May 01 '18 at 21:17
  • Well, you can disagree as much as you want [but it is a fact](https://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow). Also, it doesn't make any sense to run `ifelse` on a scalar, it wasn't designed for it. you can just us `if`. Also, not only it being slow, it also has unpredictable behavior (if one haven't gone through all the details in the docs). And what is this "exercise" you are talking about? If you want to explain the OP what not to do, it is fine, but teaching them other things there shouldn't do instead is not as much. – David Arenburg May 02 '18 at 04:48
  • @DavidArenburg I agree that if() would be more clear function to apply on scalars. As for ifelse() being slow, I would be very interested to see a better approach to use it in the following example without loosing clarity: x <-rnorm(1000), pos <- ifelse(x>0, T, F) . The link you post here explores a specific case of ifelse(is.na()) combination. – Katia May 02 '18 at 10:06
  • Just do `x>0`? And the link I've provided has absolutely nothing to with a *a specific case of `ifelse(is.na())`* – David Arenburg May 02 '18 at 10:07
  • @DavidArenburg. David, it's 6am in the morning and yes, my example was stupid :) I meant more something like if else(x>0, "F","M") – Katia May 02 '18 at 10:12
  • There are many good alternatives to `ifelse`. Especially to the cases when `ifelse` staments become nested. See [here](https://stackoverflow.com/questions/30494520/alternatives-to-nested-ifelse-statements-in-r?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa), for example. For your example, I would still not use `ifelse`. Either use two assign call by subset or something like `c("M", "F")[(x > 0) + 1]`. But this whole discussion gets out of scope. – David Arenburg May 02 '18 at 10:30