Splitting a data.table with the by-operator: functions that return numeric values and/or NAs fail

Question

I have a data.table with two columns: one ID column and one value column. I want to split up the table by the ID column and run a function foo on the value column. This works fine as long as foo does not return NAs. In that case, I get an error that tells me that the types of the groups are not consistent. My assumption is that - since is.logical(NA) equals TRUE and is.numeric(NA) equals FALSE, data.table internally assumes that I want to combine logical values with numeric ones and returns an error. However, I find this behavior peculiar. Any comments on that? Do I miss something obvious here or is that indeed intended behavior? If so, a short explanation would be great. (Notice that I do know a work-around: just let foo2 return a complete improbable number and filter for that later. However, this seems bad coding).

Here is the example:

library(data.table)
foo1 <- function(x) {if (mean(x) < 5) {return(1)} else {return(2)}}
foo2 <- function(x) {if (mean(x) < 5) {return(1)} else {return(NA)}}
DT <- data.table(ID=rep(c("A", "B"), each=5), value=1:10)
DT[, foo1(value), by=ID] #Works perfectly
     ID V1
[1,]  A  1
[2,]  B  2
DT[, foo2(value), by=ID] #Throws error
Error in `[.data.table`(DT, , foo2(value), by = ID) : 
columns of j don't evaluate to consistent types for each group: result for group 2 has column 1 type 'logical' but expecting type 'numeric'

Josh O'Brien · Accepted Answer · 2011-10-31T23:38:15.483

11

You can fix this by specifying that your function should return an NA_real_, rather than an NA of the default type.

foo2 <- function(x) {if (mean(x) < 5) {return(1)} else {return(NA)}}
DT[, foo2(value), by=ID] #Throws error
# Error in `[.data.table`(DT, , foo2(value), by = ID) : 
# columns of j don't evaluate to consistent types for each group: 
# result for group 2 has column 1 type 'logical' but expecting type 'numeric'

foo3 <- function(x) {if (mean(x) < 5) {return(1)} else {return(NA_real_)}}
DT[, foo3(value), by=ID] #Works
#      ID V1
# [1,]  A  1
# [2,]  B NA

Incidentally the message that foo2() gives when it fails is nicely informative. It essentially tells you that your NA is of the wrong type. To fix the problem, you just need to look for the NA constant of the right type (or class):

NAs <- list(NA, NA_integer_, NA_real_, NA_character_, NA_complex_)
data.frame(contantName = sapply(NAs, deparse), 
           class       = sapply(NAs, class),
           type        = sapply(NAs, typeof))

#     contantName     class      type
# 1            NA   logical   logical
# 2   NA_integer_   integer   integer
# 3      NA_real_   numeric    double
# 4 NA_character_ character character
# 5   NA_complex_   complex   complex

edited Oct 31 '11 at 23:38

answered Oct 31 '11 at 23:11

Josh O'Brien

159,210
26
366
455

1

The more I work with `R`, the more I realize how much stuff I just don't know. This `NA_real_' trick is definitely one of it. So thanks @Josh O'Brien, great answer. – Christoph_J Oct 31 '11 at 23:33
Thanks. I added a bit more about `NA` constants to my answer since this has often been useful to me, and it's an aspect of `NA` values that is typically invisible to users. Which is just as it should be! – Josh O'Brien Oct 31 '11 at 23:42
What if you don't know the class ahead of time? Has been a big problem for me in data.table when using the `by=` argument. You ever run into a problem like this? – rbatt Nov 11 '15 at 18:54
@rbatt The solution in cases like that (if I correctly understand your question) will be to ensure that the result returned by the function applied to each `by` group is always of the same type. [Here is an example that might help you see what I mean](http://stackoverflow.com/a/12125882/980833). Scroll down in particular to the part where I wrap up the call to `median(X)` (which sometimes produces an `"integer"` and sometimes a `"numeric"` class object) in a call to `as.double()`, ensuring that the results will consistently have class `"numeric"`. – Josh O'Brien Nov 11 '15 at 19:16
@JoshO'Brien Gotcha, thanks. So would you recommend just using a `switch(class(x), double = as.numeric(...), character = as.character(...), ...` type approach? For some functions it's clear that it should be a particular type. I was using a function with `unique` at its core, so the output could be any class. I ended up doing a workaround so I didn't have to specify the type of NA returned (just allowed `unique()` to do that for me). But it still raised the question for me. I could create a new question with more detail. – rbatt Nov 11 '15 at 19:44
@rbatt Yeah, I'd suggest creating a new question with a minimial reproducible example that exhibits the behavior you're talking about. If you do that (perhaps linking to this and/or the other related answer), I'm sure you'll get a good response. – Josh O'Brien Nov 11 '15 at 19:50
@JoshO'Brien I asked a related question [Here](http://stackoverflow.com/q/34091811/2343633) – rbatt Dec 04 '15 at 15:32

Splitting a data.table with the by-operator: functions that return numeric values and/or NAs fail

1 Answers1

Linked