1

I am trying to create an ffdf dataframe by merging and appending two existing ffdf dataframes. The ffdfs have different numbers of columns and different row numbers. I know that merge() performs only inner and left outer joins while ffdfappend() will not allow appending if columns are not identical. I am wondering if anyone has a workaround for this. Either a function like the smartbind() function in the gtools package or any other workaround.

Of course converting back to as.data.frame() and using smartbind() is not an option because of the size of the ffdfs.

Any help would be greatly appreciated.

Edit: As per suggesting here is a reproducible example:

require(ff)
require(ffbase)

df1 <- data.frame(A=1:10, B=LETTERS[1:10], C=rnorm(10), G=1 )
df2 <- data.frame(A=11:20, D=rnorm(10), E=letters[1:10], G=1 )
ffdf1 <- as.ffdf(df1) 
ffdf2 <- as.ffdf(df2)

The desired result should look something like this (produced on the data.frames, if I knew how to produce it on the ffdfs I would not be asking the question):

require(gtools)
dfcombined <- smartbind(df1, df2)
dfcombined
      A    B          C G          D    E
1:1   1    A  1.1556719 1         NA <NA>
1:2   2    B  0.3279260 1         NA <NA>
1:3   3    C  0.4067643 1         NA <NA>
1:4   4    D -0.9144717 1         NA <NA>
1:5   5    E -0.1138263 1         NA <NA>
1:6   6    F  0.8227560 1         NA <NA>
1:7   7    G  0.3394098 1         NA <NA>
1:8   8    H  1.4498439 1         NA <NA>
1:9   9    I -1.3202419 1         NA <NA>
1:10 10    J  0.2099266 1         NA <NA>
2:1  11 <NA>         NA 1 -1.5802636    a
2:2  12 <NA>         NA 1  1.2925790    b
2:3  13 <NA>         NA 1  1.3477483    c
2:4  14 <NA>         NA 1 -1.6760211    d
2:5  15 <NA>         NA 1  0.1456295    e
2:6  16 <NA>         NA 1  0.4726867    f
2:7  17 <NA>         NA 1 -1.5209117    g
2:8  18 <NA>         NA 1  0.3407136    h
2:9  19 <NA>         NA 1  1.3582868    i
2:10 20 <NA>         NA 1 -1.5083929    j

I hope this makes it clearer what I try to achieve.

Community
  • 1
  • 1
Rkook
  • 63
  • 1
  • 11
  • @RicardoSaporta It's not implemented for `ffbase:::merge.ffdf`. `if ((all.x == TRUE & all.y == TRUE) | (all.y == TRUE & all.x == TRUE)) { stop("merge.ffdf only allows inner joins")`. And this question could use a reproducible example. – Jake Burkhead Jan 27 '14 at 05:05
  • I am posting the following as an comment as I couldn't get it to run on a real (1E8) size ffdf (changing `nrow` resulted in a 'Could not allocate...' error): One trick is to first merge a small part of the two `ffdf` using, for example `smartmatch`. Then resize this object to fit `ffdf1` and `ffdf2`. Copy `ffdf1` into the first halve of this object and `ffdf2` into the second halve. (here be code example) – Jan van der Laan Jan 27 '14 at 07:32
  • @ Jen van der Laan: That sounds like a workable solution but I cannot see the code example. – Rkook Jan 27 '14 at 08:12
  • @Rkook the code was too long to add to the comment. I have now posted it as an answer, perhaps it does run on your objects. – Jan van der Laan Jan 27 '14 at 08:29

2 Answers2

0

The following answer doesn't seem to work on large ffdf objects (1E8 records). After initially posting part of it as an comment, I decided to post it as an answer as the code might be a starting point for a working answer.

One trick is to first merge a small part of the two ffdf using, for example smartmatch. Then resize this object to fit ffdf1 and ffdf2. Copy ffdf1 into the first halve of this object and ffdf2 into the second halve:

require(gtools)
dfcombined <- as.ffdf(smartbind(ffdf1[1,], ffdf2[1,]))

nrow(dfcombined) <- nrow(ffdf1) + nrow(ffdf2)

# insert ffdf1 into dfcombined
cols1a <- names(dfcombined)[names(dfcombined) %in% names(ffdf1)]
cols1b <- names(dfcombined)[!(names(dfcombined) %in% names(ffdf1))]

dfcombined[ri(1, nrow(ffdf1)), cols1a] <- ffdf1
dfcombined[ri(1, nrow(ffdf1)), cols1b] <- NA

# insert ffdf2 into dfcombined
cols2a <- names(dfcombined)[names(dfcombined) %in% names(ffdf2)]
cols2b <- names(dfcombined)[!(names(dfcombined) %in% names(ffdf2))]

dfcombined[ri(nrow(ffdf1)+1, nrow(dfcombined)), cols2a] <- ffdf2
dfcombined[ri(nrow(ffdf1)+1, nrow(dfcombined)), cols2b] <- NA

However, when testing this on real sized ffdf the ncol(dfcombined) <- ... line generates an error

> ffdf1 <- ffdf(
+   a = ffrandom(1E8, rnorm),
+   b = ffrandom(1E8, rnorm)
+ )
> ffdf2 <- ffdf(
+   b = ffrandom(1E8, rnorm),
+   c = ffrandom(1E8, rnorm)
+ )
> dfcombined <- as.ffdf(smartbind(ffdf1[1,], ffdf2[1,]))
> 
> nrow(dfcombined) <- nrow(ffdf1) + nrow(ffdf2)
Error: cannot allocate vector of size 762.9 Mb
Jan van der Laan
  • 8,005
  • 1
  • 20
  • 35
  • @ Jan van der Laan: I tried this method and it worked on my dataset without the error message you describe. However, the code does not seem to produce the exact same output as the smartbind function. When running the last line: `dfcombined[ri(nrow(ffdf1)+1, nrow(dfcombined)), cols2b] <- NA` on the example dataset I get the following warning: – Rkook Jan 27 '14 at 20:40
  • (sorry pressed add by accident) _Warning message: In ram2ffcode(value, fflev, vmode) : unknown factor values mapped to NA_. I am guessing this is because column D is numeric? The output then has `1.000000` rather than `NA` for the `dfcombined` from row 1-10. It does not matter in my datasets, others need to consider this, however. – Rkook Jan 27 '14 at 20:51
0

If you are looking for something like rbind.fill but for ffdf objects. Maybe this is what you are looking for. This worked for me without memory issues on the test example Jan prepared.

require(ff)
require(ffbase)
smartffdfbind <- function(..., clone=TRUE){
  x <- list(...)
  columns <- lapply(x, FUN=function(x) colnames(x))
  columns <- do.call(c, columns)
  columns <- unique(columns)
  for(element in 1:length(x)){
    missingcolumns <- setdiff(columns, colnames(x[[element]]))
    for(missingcolumn in missingcolumns){
      x[[element]][[missingcolumn]] <- ff(NA, vmode = "logical", length = nrow(x[[element]]))
    }
  }
  if(clone){
    result <- clone(x[[1]][columns])
  }else{
    result <- x[[1]][columns]
  }
  for (l in tail(x, -1)) {
    result <- ffdfappend(result[columns], l[columns], recode=TRUE)
  }
  result
}

ffdf1 <- ffdf(a = ffrandom(1E8, rnorm), b = ffrandom(1E8, rnorm))
ffdf2 <- ffdf(b = ffrandom(1E8, rnorm), c = ffrandom(1E8, rnorm))

x <- smartffdfbind(ffdf1, ffdf2)
nrow(x)
[1] 200000000
class(x)
"ffdf"
  • @ jwijffels: Thanks a lot for this function. It worked and gave the intended results. Even though it was not very fast on my data (I have a lot of columns). I think that such a function would be a good addition to the package seeing that only left outer and inner joins and appending with identical columns is possible to date. – Rkook Jan 28 '14 at 04:20
  • ok, added it as a feature request at https://github.com/edwindj/ffbase/issues/33. This will probably end up in ffbase soon - probably with another function name. –  Jan 28 '14 at 08:51
  • FYI. Added at https://github.com/edwindj/ffbase with function name ffdfrbind.fill –  Jan 29 '14 at 16:44