14

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table?

library(data.table)

aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)

dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))

dt.11 <- rbind(dt.1, dt.1)  # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2)  # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2)  # What I need, doesn't work either

I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.

Chris
  • 2,256
  • 1
  • 19
  • 41

6 Answers6

14

Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:

If TRUE fills missing columns with NAs. By default FALSE. When TRUE, use.names has to be TRUE, and all items of the input list has to have non-null column names.

Thus you can do (prior to approx v1.9.6):

data.table::rbind(dt.1, dt.2, fill=TRUE) 
#    aa bb cc
# 1:  1  2 NA
# 2:  2  3 NA
# 3:  3  4 NA
# 4:  1  2  3
# 5:  2  3  4
# 6:  3  4  5

UPDATE for v1.9.6:

This now works directly:

rbind(dt.1, dt.2, fill=TRUE)
#    aa bb cc
# 1:  1  2 NA
# 2:  2  3 NA
# 3:  3  4 NA
# 4:  1  2  3
# 5:  2  3  4
# 6:  3  4  5
Daniel Krizian
  • 4,586
  • 4
  • 38
  • 75
  • 2
    `Error: 'rbind' is not an exported object from 'namespace:data.table'` ? –  Jan 01 '16 at 08:37
  • `rbind` has been now been adapted to dispatch directly to a `data.table` method and works directly, see the edit above and [GitHub](https://github.com/Rdatatable/data.table/search?utf8=%E2%9C%93&q=rbind) – Daniel Krizian Jan 02 '16 at 00:09
  • @DanielKrizian Groovy! – Hack-R Jan 03 '16 at 02:11
  • @DanielKrizian I noticed that `rbind` worked by itself with `data.table` loaded but I'm still surprised it would throw that error. Guess I still don't get the namespace usage fully. For instance, `p_load` works as `pacman::p_load()` or just `p_load()` after `require(pacman)`. –  Jan 03 '16 at 12:51
  • That's due to some custom overloading work. Studying [.onLoad on GitHub](https://github.com/Rdatatable/data.table/blob/d607425b44155916415485e263cd830828903ab2/R/onLoad.R) is very instructive. – Daniel Krizian Jan 03 '16 at 17:23
5

Here is an approach that will update the missing columns in

rbind.missing <- function(A, B) { 

  cols.A <- names(A)
  cols.B <- names(B)

  missing.A <- setdiff(cols.B,cols.A)
  # check and define missing columns in A
  if(length(missing.A) > 0L){
   # .. means "look up one level"
   class.missing.A <- lapply(B[, ..missing.A], class)
   nas.A <- lapply(class.missing.A, as, object = NA)
   A[,c(missing.A) := nas.A]
  }
  # check and define missing columns in B
  missing.B <- setdiff(names(A), cols.B)
  if(length(missing.B) > 0L){
    class.missing.B <- lapply(A[, ..missing.B], class)
    nas.B <- lapply(class.missing.B, as, object = NA)
    B[,c(missing.B) := nas.B]
  }
  # reorder so they are the same
  setcolorder(B, names(A))
  rbind(A, B)

}

rbind.missing(dt.1,dt.2)

##    aa bb cc
## 1:  1  2 NA
## 2:  2  3 NA
## 3:  3  4 NA
## 4:  1  2  3
## 5:  2  3  4
## 6:  3  4  5

This will not be efficient for many, or large data.tables, as it only works two at a time.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
mnel
  • 113,303
  • 27
  • 265
  • 254
4

The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.

Community
  • 1
  • 1
KarthikS
  • 883
  • 1
  • 11
  • 17
2

the basic concept is to add missing columns in both directions: from the running master table to the newTable and back the other way.

As @menl pointed out in the comments, simply assigning an NA is a problem, because that will make the whole column of class logical.

One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.

Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class) as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName) on the LHS to assign.

We can then wrap this in a function and use S3 methods so that we can simply call

rbindFill(A, B)

Below is the function.

rbindFill.data.table <- function(master, newTable)  {
# Append newTable to master

    # assign to Master
    #-----------------#
      # identify columns missing
      colMisng     <- setdiff(names(newTable), names(master))

      # if there are no columns missing, move on to next part
      if (!identical(colMisng, character(0)))  {
           # identify class of each
            colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))

            # assign to each column value of NA with appropriate class 
            master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
          }

    # assign to newTable
    #-----------------#
      # identify columns missing
      colMisng     <- setdiff(names(master), names(newTable))

      # if there are no columns missing, move on to next part
      if (!identical(colMisng, character(0)))  {
        # identify class of each
        colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))

        # assign to each column value of NA with appropriate class 
        newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
      }

    # reorder columns to avoid warning about ordering
    #-----------------#
      colOrdering <- colOrderingByOtherCol(newTable, names(master))
      setcolorder(newTable,  colOrdering)

    # rbind them! 
    #-----------------#
      rbind(master, newTable)
  }

  # implement generic function
  rbindFill <- function(x, y, ...) UseMethod("rbindFill")


Example Usage:

    # Sample Data: 
    #--------------------------------------------------#
    A  <- data.table(a=1:3, b=1:3, c=1:3)
    A2 <- data.table(a=6:9, b=6:9, c=6:9)
    B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
    C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
    #--------------------------------------------------#

    # Four iterations of calling rbindFill
    master <- rbindFill(A, B)
    master <- rbindFill(master, A2)
    master <- rbindFill(master, C)

    # Results:
    master
    #      a  b c  d  m     n     f
    #  1:  1  1 1 NA NA    NA    NA
    #  2:  2  2 2 NA NA    NA    NA
    #  3:  3  3 3 NA NA    NA    NA
    #  4: NA  1 1  1  A    NA    NA
    #  5: NA  2 2  2  B    NA    NA
    #  6: NA  3 3  3  C    NA    NA
    #  7:  6  6 6 NA NA    NA    NA
    #  8:  7  7 7 NA NA    NA    NA
    #  9:  8  8 8 NA NA    NA    NA
    # 10:  9  9 9 NA NA    NA    NA
    # 11: NA NA 7 NA NA  0.86  TRUE
    # 12: NA NA 8 NA NA -1.15 FALSE
    # 13: NA NA 9 NA NA  1.10  TRUE
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • You need to assign by reference to the correct class (`NA` is `logical`), otherwise in the example data sets, dt.1[,cc := NA]; rbind(dt.1,dt.2). – mnel Feb 22 '13 at 04:14
  • It is a difficult problem to solve while not doing a bit more preprocessing. something like `foo <- lapply(lapply(dt.1, class), as,object=NA); dt.1[,names(foo) := foo]` will work, but obviously this will assign `NA` to all columns, so it is a matter of assigning only those you want. – mnel Feb 22 '13 at 04:46
  • I've mocked something up in an answer. Not a complete approach yet, but perhaps a step closer. – mnel Feb 22 '13 at 05:05
  • Thanks a lot both of you for your help. I got mnel's solution to work, and realized it is was way above what I would have figured out myself. Thanks! – Chris Feb 22 '13 at 11:28
2

Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.

Also, using rbindlist() should be much faster when dealing with data.tables.

Define the tables (same as mnel's code above):

library(data.table)
A  <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B  <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C  <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)

Insert the missing variables in table A: (note the use of A2[0]

A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)

Insert the missing columns in table A2:

A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)

Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):

setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL

Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...

DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))

DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL

The result looks the same as mnels' output (except for the random numbers and the column order).

PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?

PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

Community
  • 1
  • 1
Peter
  • 1,016
  • 9
  • 20
0

dplyr function bind_rows performs it automatically

aa <- c(1,2,3) bb <- c(2,3,4) cc <- c(3,4,5)

dt.1 <- data.table(cbind(aa, bb)) dt.2 <- data.table(cbind(aa, bb, cc))

(dt.11 <- bind_rows(dt.1, dt.2))

aa bb cc
1:  1  2 NA
2:  2  3 NA
3:  3  4 NA
4:  1  2  3
5:  2  3  4
6:  3  4  5