Rbind with new columns and data.table

Question

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table?

library(data.table)

aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)

dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))

dt.11 <- rbind(dt.1, dt.1)  # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2)  # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2)  # What I need, doesn't work either

I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.

will the columns always be in the same order (i.e does column 1 remain column 1 throughout all the tables?) — mnel, Feb 21 '13 at 23:55
Yes, there are about 20 columns that have the same names in all tables, but some tables have a few more columns (with new names) — Chris, Feb 22 '13 at 00:00

Daniel Krizian · Answer 1 · 2016-01-02T00:13:00.453

14

Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:

If TRUE fills missing columns with NAs. By default FALSE. When TRUE, use.names has to be TRUE, and all items of the input list has to have non-null column names.

Thus you can do (prior to approx v1.9.6):

data.table::rbind(dt.1, dt.2, fill=TRUE) 
#    aa bb cc
# 1:  1  2 NA
# 2:  2  3 NA
# 3:  3  4 NA
# 4:  1  2  3
# 5:  2  3  4
# 6:  3  4  5

UPDATE for v1.9.6:

This now works directly:

rbind(dt.1, dt.2, fill=TRUE)
#    aa bb cc
# 1:  1  2 NA
# 2:  2  3 NA
# 3:  3  4 NA
# 4:  1  2  3
# 5:  2  3  4
# 6:  3  4  5

edited Jan 02 '16 at 00:13

answered Oct 10 '14 at 07:35

Daniel Krizian

4,586
4
38
75

2

`Error: 'rbind' is not an exported object from 'namespace:data.table'` ? – Jan 01 '16 at 08:37
`rbind` has been now been adapted to dispatch directly to a `data.table` method and works directly, see the edit above and [GitHub](https://github.com/Rdatatable/data.table/search?utf8=%E2%9C%93&q=rbind) – Daniel Krizian Jan 02 '16 at 00:09
@DanielKrizian Groovy! – Hack-R Jan 03 '16 at 02:11
@DanielKrizian I noticed that `rbind` worked by itself with `data.table` loaded but I'm still surprised it would throw that error. Guess I still don't get the namespace usage fully. For instance, `p_load` works as `pacman::p_load()` or just `p_load()` after `require(pacman)`. – Jan 03 '16 at 12:51
That's due to some custom overloading work. Studying [.onLoad on GitHub](https://github.com/Rdatatable/data.table/blob/d607425b44155916415485e263cd830828903ab2/R/onLoad.R) is very instructive. – Daniel Krizian Jan 03 '16 at 17:23

score 5 · Accepted Answer · edited Oct 01 '19 at 02:33

Here is an approach that will update the missing columns in

rbind.missing <- function(A, B) { 

  cols.A <- names(A)
  cols.B <- names(B)

  missing.A <- setdiff(cols.B,cols.A)
  # check and define missing columns in A
  if(length(missing.A) > 0L){
   # .. means "look up one level"
   class.missing.A <- lapply(B[, ..missing.A], class)
   nas.A <- lapply(class.missing.A, as, object = NA)
   A[,c(missing.A) := nas.A]
  }
  # check and define missing columns in B
  missing.B <- setdiff(names(A), cols.B)
  if(length(missing.B) > 0L){
    class.missing.B <- lapply(A[, ..missing.B], class)
    nas.B <- lapply(class.missing.B, as, object = NA)
    B[,c(missing.B) := nas.B]
  }
  # reorder so they are the same
  setcolorder(B, names(A))
  rbind(A, B)

}

rbind.missing(dt.1,dt.2)

##    aa bb cc
## 1:  1  2 NA
## 2:  2  3 NA
## 3:  3  4 NA
## 4:  1  2  3
## 5:  2  3  4
## 6:  3  4  5

This will not be efficient for many, or large data.tables, as it only works two at a time.

score 4 · Answer 3 · edited May 23 '17 at 12:02

4

The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.

edited May 23 '17 at 12:02

Community

1
1

answered Oct 09 '15 at 19:02

KarthikS

883
1
11
17

Ricardo Saporta · Answer 4 · 2013-02-22T06:08:20.017

the basic concept is to add missing columns in both directions: from the running master table to the newTable and back the other way.

As @menl pointed out in the comments, simply assigning an NA is a problem, because that will make the whole column of class logical.

One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.

Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class) as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName) on the LHS to assign.

We can then wrap this in a function and use S3 methods so that we can simply call

rbindFill(A, B)

Below is the function.

rbindFill.data.table <- function(master, newTable)  {
# Append newTable to master

    # assign to Master
    #-----------------#
      # identify columns missing
      colMisng     <- setdiff(names(newTable), names(master))

      # if there are no columns missing, move on to next part
      if (!identical(colMisng, character(0)))  {
           # identify class of each
            colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))

            # assign to each column value of NA with appropriate class 
            master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
          }

    # assign to newTable
    #-----------------#
      # identify columns missing
      colMisng     <- setdiff(names(master), names(newTable))

      # if there are no columns missing, move on to next part
      if (!identical(colMisng, character(0)))  {
        # identify class of each
        colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))

        # assign to each column value of NA with appropriate class 
        newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
      }

    # reorder columns to avoid warning about ordering
    #-----------------#
      colOrdering <- colOrderingByOtherCol(newTable, names(master))
      setcolorder(newTable,  colOrdering)

    # rbind them! 
    #-----------------#
      rbind(master, newTable)
  }

  # implement generic function
  rbindFill <- function(x, y, ...) UseMethod("rbindFill")

Example Usage:

    # Sample Data: 
    #--------------------------------------------------#
    A  <- data.table(a=1:3, b=1:3, c=1:3)
    A2 <- data.table(a=6:9, b=6:9, c=6:9)
    B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
    C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
    #--------------------------------------------------#

    # Four iterations of calling rbindFill
    master <- rbindFill(A, B)
    master <- rbindFill(master, A2)
    master <- rbindFill(master, C)

    # Results:
    master
    #      a  b c  d  m     n     f
    #  1:  1  1 1 NA NA    NA    NA
    #  2:  2  2 2 NA NA    NA    NA
    #  3:  3  3 3 NA NA    NA    NA
    #  4: NA  1 1  1  A    NA    NA
    #  5: NA  2 2  2  B    NA    NA
    #  6: NA  3 3  3  C    NA    NA
    #  7:  6  6 6 NA NA    NA    NA
    #  8:  7  7 7 NA NA    NA    NA
    #  9:  8  8 8 NA NA    NA    NA
    # 10:  9  9 9 NA NA    NA    NA
    # 11: NA NA 7 NA NA  0.86  TRUE
    # 12: NA NA 8 NA NA -1.15 FALSE
    # 13: NA NA 9 NA NA  1.10  TRUE

You need to assign by reference to the correct class (`NA` is `logical`), otherwise in the example data sets, dt.1[,cc := NA]; rbind(dt.1,dt.2). — mnel, Feb 22 '13 at 04:14
It is a difficult problem to solve while not doing a bit more preprocessing. something like `foo <- lapply(lapply(dt.1, class), as,object=NA); dt.1[,names(foo) := foo]` will work, but obviously this will assign `NA` to all columns, so it is a matter of assigning only those you want. — mnel, Feb 22 '13 at 04:46
I've mocked something up in an answer. Not a complete approach yet, but perhaps a step closer. — mnel, Feb 22 '13 at 05:05
Thanks a lot both of you for your help. I got mnel's solution to work, and realized it is was way above what I would have figured out myself. Thanks! — Chris, Feb 22 '13 at 11:28

score 2 · Answer 5 · edited May 23 '17 at 12:10

Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.

Also, using rbindlist() should be much faster when dealing with data.tables.

Define the tables (same as mnel's code above):

library(data.table)
A  <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)

Insert the missing variables in table A: (note the use of A2[0]

A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)

Insert the missing columns in table A2:

A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)

Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):

setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL

Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...

DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))

DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL

The result looks the same as mnels' output (except for the random numbers and the column order).

PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?

PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

score 0 · Answer 6 · answered Mar 23 '23 at 14:27

dplyr function bind_rows performs it automatically

aa <- c(1,2,3) bb <- c(2,3,4) cc <- c(3,4,5)

dt.1 <- data.table(cbind(aa, bb)) dt.2 <- data.table(cbind(aa, bb, cc))

(dt.11 <- bind_rows(dt.1, dt.2))

aa bb cc
1:  1  2 NA
2:  2  3 NA
3:  3  4 NA
4:  1  2  3
5:  2  3  4
6:  3  4  5

Rbind with new columns and data.table

6 Answers6

Linked

Related