7

Consider this

do.call(rbind, list(data.table(x=1, b='x'),data.table(x=1, b=NA)))

returns

   x  b
1: 1  x
2: 1 NA

but

do.call(rbind, list(data.table(x=1, b=NA),data.table(x=1, b='x')))

returns

   x  b
1: 1 NA
2: 1 NA

How can i force the first behavior, without reordering the contents of the list?

Data table is really really faster in mapreduce jobs (calling data.table ~10*3MM times across 55 nodes, the data table is many many times faster than data frame, so i want this to work ...) Regards saptarshi

Sapsi
  • 711
  • 5
  • 16
  • 1
    I'm guessing this happens because `NA` is logical and `as.logical('x')=NA`, so when `rbind` decides that that column is logical (based on its first argument), it coerces subsequent items to match. `do.call(rbind, list(data.table(x=1, b=as(NA,'character')),data.table(x=1, b='x')))` works... – Frank Aug 27 '13 at 20:56
  • 3
    By the way, there is an "optimized `do.call(rbind,...)`" for data.tables called `rbindlist`. There are a few q's about it on this site, e.g., http://stackoverflow.com/questions/15673550/why-is-rbindlist-better-than-rbind/15673654#15673654 – Frank Aug 27 '13 at 20:57
  • 1
    @Frank -- Very helpful comments. I've added a reference to `rbindlist` to my answer. – Josh O'Brien Aug 27 '13 at 21:40

1 Answers1

9

As noted by Frank, the problem is that there are (somewhat invisibly) several different types of NA. The one produced when you type NA at the command line is of class "logical", but there are also NA_integer_, NA_real_, NA_character_, and NA_complex_.

In your first example, the initial data.table sets the class of column b to "character", and the NA in the second data.table is then coerced to an NA_character_. In the second example, though, the NA in the first data.table sets column b's class to "logical", and, when the same column in the second data.table is coerced to "logical", it's converted to a logical NA. (Try as.logical("x") to see why.)

That's all fairly complicated (to articulate, at least), but there is a reasonably simple solution. Just create a 1-row template data.table, and prepend it to each list of data.table's you want to rbind(). It will establish the class of each column to be what you want, regardless of what data.table's follow it in the list passed to rbind(), and can be trimmed off once everything else is bound together.

library(data.table)    

## The two lists of data.tables from the OP
A <- list(data.table(x=1, b='x'),data.table(x=1, b=NA))
B <- list(data.table(x=1, b=NA),data.table(x=1, b='x'))

## A 1-row template, used to set the column types (and then removed)
DT <- data.table(x=numeric(1), b=character(1))

## Test it out
do.call(rbind, c(list(DT), A))[-1,]
#    x  b
# 1: 1  x
# 2: 1 NA
do.call(rbind, c(list(DT), B))[-1,]
#    x  b
# 1: 1 NA
# 2: 1  x

## Finally, as _also_ noted by Frank, rbindlist will likely be more efficient
rbindlist(c(list(DT), B)[-1,]
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • 1
    Of course that would presumably slow the `rbind`ing down somewhat in all cases. On the other hand, it might not be too hard to add a second 'colClasses' argument to `rbindlist()`, allowing users to pass in either a character vector of class names or a list with elements of the desired classes. – Josh O'Brien Aug 27 '13 at 23:39