8

I just discovered this bug, only to find that some people are calling it a "feature". This makes rbindlist NOT like do.call("rbind",l) as rbind WILL respect column names. Further, there is no mention of this entirely unexpected behavior in the documentation. Is this really intentional?

Code example:

> library(data.table)
> DT1 <- data.table(a=1, b=2)
> DT2 <- data.table(b=3, a=4)
> DT1
a b
1: 1 2
> DT2
b a
1: 3 4

I would expect that rbind'ing these would produce columns with a = 1,4 ; b = 2,3. And get that with rbind.data.table and rbind.data.frame, though rbind.data.table produces warnings.

> rbind(DT1, DT2)
a b
1: 1 2
2: 4 3
Warning message:
In data.table::.rbind.data.table(...) :
Argument 2 has names in a different order. Columns will be bound by name for consistency with base. You can drop names (by using an unnamed list) and the columns will then be joined by position, or set use.names=FALSE. Alternatively, explicitly setting use.names to TRUE will remove this warning.
> rbind(as.data.frame(DT1), as.data.frame(DT2))
a b
1 1 2
2 4 3
> do.call('rbind', list(DT1, DT2))
a b
1: 1 2
2: 4 3
Warning message:
In data.table::.rbind.data.table(...) :
Argument 2 has names in a different order. Columns will be bound by name for consistency with base. You can drop names (by using an unnamed list) and the columns will then be joined by position, or set use.names=FALSE. Alternatively, explicitly setting use.names to TRUE will remove this warning.

rbindlist, however, is happy to silently corrupt the data:

> rbindlist(list(DT1, DT2))
a b
1: 1 2
2: 3 4
Community
  • 1
  • 1
James
  • 630
  • 1
  • 6
  • 15
  • 1
    Have a look at this [excellent answer](http://stackoverflow.com/a/15673654/1627235). – Sven Hohenstein Feb 06 '14 at 14:32
  • 3
    `rbindlist` is optimized for speed. Matching column names would be counterproductive and I hope that the default behaviour won't change. However, fell free to submit a feature request. – Roland Feb 06 '14 at 14:33
  • Sven, I link to that in my post. It doesn't seem particularly authoritative to me. Roland, speed is useless if you are going around corrupting data. Silently at that. Further, what is the point of using a data structure with named columns if the names aren't respected? – James Feb 06 '14 at 15:23
  • `dplyr::rbind_all` should be both fast and safe. Haven't formally benchmarked against `data.table::rbindlist` though. – hadley Feb 06 '14 at 15:28
  • Thanks hadley, I've seen dplyr pop up in a few places. I haven't had a chance to play with it, but if using your package doesn't waste a month of my compute time, you may soon have a new user. – James Feb 06 '14 at 15:47
  • @CarlWitthoft that exact link is in the first sentence of my post. Obviously I've read it and found that it doesn't explain the behavior I'm seeing. – James Feb 06 '14 at 15:50
  • 1
    @Roland, I love this feature as well. Maybe an argument `match.names=TRUE` might be nice to have? – Arun Feb 06 '14 at 16:32
  • 1
    @James, yes this seems to have been missed out of the documentation. Will edit. Thanks. – Arun Feb 06 '14 at 16:34
  • @hadley, `rbind_all` by default (silently) fills columns. That is, if the two data.frames had names `a,b` and `a,c`, `rbind_all` would result in `a,b,c`. – Arun Feb 06 '14 at 16:37
  • Reasons for this being the default behavior are well articulated in that link. That said, having more options would be great and I think I've seen versions of this FR (letting `rbindlist` have similar options as `rbind` - like `fill` or `use.names`, but not necessarily same defaults) floating around, but you can certainly add another one if you don't see it in the list of FRs. – eddi Feb 06 '14 at 16:40
  • Sorry if this is so simple that it isn't worth mentioning but just do `setcolorder(DT2,colnames(DT1))` before `rbindlist(list(DT1,DT2))` – Dean MacGregor Feb 06 '14 at 17:41
  • @DeanMacGregor, yes, I've gone through and changed all my code to do: lapply(list_of_DTs, function(x) setcolorder(names(list_of_DTs[[1]]))) before every call to rbindlist. However, it still doesn't make sense that this isn't the behavior. The point of working with data.frame/data.table is that you have NAMED columns of equal length. If you operate on them as if the names don't exist, then the behavior is objectively wrong from the perspective that the names are meaningful. Might as well just drop the names and only operate on indexes ever. Anything else is misleading. – James Feb 06 '14 at 19:01
  • Reopened (dint realise the gold-badge privilege also works for that) to include updates to `rbindlist` and some benchmarks. – Arun May 20 '14 at 11:49

1 Answers1

8

This feature is now implemented in commit 1266 of v1.9.3. From NEWS:

o  'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented 
   entirely in C. Closes #5249    
  -> use.names by default is FALSE for backwards compatibility (doesn't bind by 
     names by default)
  -> rbind(...) now just calls rbindlist() internally, except that 'use.names' 
     is TRUE by default, for compatibility with base (and backwards compatibility).
  -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
  -> At least one item of the input list has to have non-null column names.
  -> Duplicate columns are bound in the order of occurrence, like base.
  -> Attributes that might exist in individual items would be lost in the bound result.
  -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
  -> And incredibly fast ;).
  -> Documentation updated in much detail. Closes DR #5158.

With this, you can set use.names=TRUE to bind by names. It's set to FALSE by default for backwards compatibility. Alternatively, you can use rbind(..) where use.names=TRUE, again for backwards compatibility.

See this post for more examples and this post for benchmarks.

Examples:

1) Just set use.names=TRUE

DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=1, x=2)

rbindlist(list(DT1,DT2), use.names=TRUE, fill=FALSE)
#    x y
# 1: 1 2
# 2: 2 1

DT1 <- data.table(x=1, y=2)
DT2 <- data.table(z=2, y=1)

# returns error when fill=FALSE but can't be bound without fill=TRUE
rbindlist(list(DT1, DT2), use.names=TRUE, fill=FALSE)
# Error in rbindlist(list(DT1, DT2), use.names = TRUE, fill = FALSE) : 
    # Answer requires 3 columns whereas one or more item(s) in the input 
    # list has only 2 columns. ...

2) Also binds duplicate column names in the order of occurrence:

DT1 <- data.table(x=1, x=2, y=10, y=20, y=30)
DT2 <- data.table(y=-10, x=-2, y=-20, x=-1, y=-30)

rbindlist(list(DT1,DT2), use.names=TRUE)

#     x  x   y   y   y
# 1:  1  2  10  20  30
# 2: -2 -1 -10 -20 -30

3) use fill=TRUE if you want to bind by names and fill missing columns

DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=2, z=-1)

rbindlist(list(DT1, DT2), fill=TRUE)
#     x y  z
# 1:  1 2 NA
# 2: NA 2 -1

HTH

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387