How to efficiently merge two datasets?

Question

I am trying to merge two fairly large - but not ridiculously so (360,000 X 4, 57,000 X 4) - datasets by one common ID. I have tried a regular merge(), merge.data.table(), and sqldf(). Every time I keep running out of memory (cannot allocate vector of size...). Is there any solution to this? Or is R a bad tool for merging data? head() is given below (I am trying to merge on STUDENT.NAME):

  ID10    STUDENT.NAME   FATHER.NAME MOTHER.NAME
1    1     DEEKSHITH J       JAYANNA      SWARNA
2    4    MANIKANTHA D       DEVARAJ     MANJULA
3    5        NAGESH T   THIMMAIAH N    SHIVAMMA
4    6    NIZAMUDDIN R NOOR MOHAMMED        BIBI
5    7 PRABHU YELLAPPA      YELLAPPA    MALLAMMA
6    8    SADDAM PASHA   NISAR AHMED     ZAREENA

If you provide the structure of data (e.g., `head(dataframe)`), maybe you can get a better answer. — kohske, Sep 16 '11 at 07:18
Please, provide more information about how you're using merge. If you're using it in a wrong way, then you may run out of memory. — Manoel Galdino, Sep 16 '11 at 07:20
Thanks for posting the output of `head`. What's the format of the other data frame (the same?) What columns do you expect in the result? — NPE, Sep 16 '11 at 07:27
It's exactly the same, except the row-dimension is different. I should have multiple matches and a bunch of non-matches. — user702432, Sep 16 '11 at 08:24
What do you want to happen in the case of multiple matches? And in the case of no matches? — Andrie, Sep 16 '11 at 08:57
Nothing. I just want the entire merged dataset with multiple matches and [empty] nonmatches. — user702432, Sep 16 '11 at 09:09
What the heck are you doing? If I merge 2 dataframes of 300,000 x 4 I use about 0.15 Gb of memory using merge. You efficiently merge two datasets by making sure your workspace is not completely cluttered with leftovers from previous analyses. Or by buying a computer with more than 1Gb of memory. Or updating to the latest R version. Or a combination of all three. — Joris Meys, Sep 16 '11 at 10:05
Joris, that is only true if there are exact matches. Please read up the R documentation. The merge() function in R is notoriously memory-hungry. — user702432, Sep 16 '11 at 11:48
@user702432 : I have 1 million rows in each, 0.2 million non matches and about 0.1 million double, triple or quadruple matches. Still, I only use 0.6 Gb of memory for that. So my question is again: what the heck are you doing? PS. I did read the documentation. — Joris Meys, Sep 16 '11 at 12:13
Please post the results of `str(...)` for each of your `data.frame`s, otherwise we're not going to be able to help you. — Andrie, Sep 16 '11 at 12:27

score 11 · Accepted Answer · answered Sep 19 '11 at 08:53

From the nature of your problem it is bound to be that you're doing a many-by-many merge, where each student occurs many times in every dataframe. You might want to check how many times. If each student occurs twice in every data frame, that means one student will make 4 rows. if a student occurs 10 times, the merge will add 100 rows. First check how many rows you'll get. This is the function I use for that:

count.rows <- function(x,y,v,all=FALSE){
    tx <- table(x[[v]])
    ty <- table(y[[v]])
    val <- val <- names(tx)[match(names(tx),names(ty),0L) > 0L]
    cts <- rbind(tx[match(val,names(tx))],ty[match(val,names(ty))])
    colnames(cts) <- val
    sum(apply(cts,2,prod,na.rm=all),na.rm=TRUE)
}
count.rows(DF1,DF2,"STUDENT.NAME")

If you would do what you asked me (read up the R documentation), you'd see that the complexity is dependent on the length of the answer. This is not due to the merge algorithm itself, but the binding of all the results together. If you really want a less memory hungry solution, you need especially to get rid of that binding. Following algorithm does that for you. I wrote it out so you can find the logic, and it can be optimized. Mind you that it does not give the same result, it copies all columns of both dataframes. So you might want to adapt that a little.

mymerge <- function(x,y,v,count.only=FALSE){
    ix <- match(v,names(x))
    iy <- match(v,names(y))

    xx <- x[,ix]
    yy <- y[,iy]
    ox <- order(xx)
    oy <- order(yy)
    xx <- xx[ox]
    yy <- yy[oy]

    nx <- length(xx)
    ny <- length(yy)

    val <- unique(xx)
    val <- val[match(val,yy,0L) > 0L]
    cts <- cbind(table(xx)[val],table(yy)[val])
    dimr <- sum(apply(cts,1,prod),na.rm=TRUE)

    idx <- vector("numeric",dimr)
    idy <- vector("numeric",dimr)
    ndx <- embed(c(which(!duplicated(xx)),nx+1),2)[unique(xx) %in% val,]
    ndy <- embed(c(which(!duplicated(yy)),ny+1),2)[unique(yy) %in% val,]

    count = 1
    for(i in 1:nrow(ndx)){
        nx <- abs(diff(ndx[i,]))
        ny <- abs(diff(ndy[i,]))
        ll <- nx*ny

        idx[count:(count+ll-1)] <-
          rep(ndx[i,2]:(ndx[i,1]-1),ny)

        idy[count:(count+ll-1)] <-
          rep(ndy[i,2]:(ndy[i,1]-1),each=nx)
        count <- count+ll
    }
    x <- x[ox[idx],]
    names(y) <- paste("y.",names(y),sep="")
    x[names(y)] <- y[oy[idy],]
    rownames(x) <- 1:nrow(x)
    x
}

Some testing code so you can see it works :

DF1 <- data.frame(
    ID = 1:10,
    STUDENT.NAME=letters[1:10],
    SCORE = 1:10
)
id <- c(3,11,4,6,6,12,1,4,7,10,5,3)
DF2 <- data.frame(
    ID = id,
    STUDENT.NAME=letters[id],
    SCORE = 1:12
)

mymerge(DF1,DF2,"STUDENT.NAME")

Doing the same with two dataframes of 0.5 million rows and 4 columns with up to 10 matches per student name, it returns a dataframe with 5.8 million rows and 8 columns andd gives following picture on the memory :

enter image description here

The yellow box is the merge call, the green box is the mymerge call. Memory ranges from 2.3Gb to 3.74Gb, so the merge call uses 1.45 Gb and mymerge a bit over 0.8 Gb. Still no "out of memory" errors... The testing code for this is below :

Names <- sapply(
      replicate(120000,sample(letters,4,TRUE),simplify=FALSE),
      paste,collapse="")

DF1 <- data.frame(
    ID10 = 1:500000,
    STUDENT.NAME = sample(Names[1:50000],500000,TRUE),
    FATHER.NAME = sample(letters,500000,TRUE),
    SCORE1 = rnorm(500000),
    stringsAsFactors=FALSE
)

id <- sample(500000,replace=TRUE)
DF2 <- data.frame(
    ID20 = DF1$ID10,
    STUDENT.NAME = DF1$STUDENT.NAME[id],
    SCORE = rnorm(500000),
    SCORE2= rnorm(500000),
    stringsAsFactors=FALSE
)
id2 <- sample(500000,20000)
DF2$STUDENT.NAME[id2] <- sample(Names[100001:120000],20000,TRUE)

gc()
system.time(X <- merge(DF1,DF2,"STUDENT.NAME"))
Sys.sleep(1)
gc()
Sys.sleep(1)
rm(X)
gc()
Sys.sleep(3)
system.time(X <- mymerge(DF1,DF2,"STUDENT.NAME"))
Sys.sleep(1)
gc()
rm(X)
gc()

Joris, Thanks for this extremely detailed response. – user702432 Sep 19 '11 at 16:06 — user702432, Sep 19 '11 at 16:06

score 2 · Answer 2 · answered Sep 19 '11 at 10:55

2

Have you tried the data.table package? It is more memory efficient and can be many times faster. But, as others have noted, this question has no code provided so it's possible you are just using merge incorrectly.

answered Sep 19 '11 at 10:55

Matt Dowle

58,872
22
166
224

score 1 · Answer 3 · edited May 23 '17 at 12:07

I agree with the other commentators who say this question lacking some in its description (lacking both the code and a complete data description) but I also wonder if it hasn't already been answered with one of these links:

R: how to rbind two huge data-frames without running out of memory

A citation offered by @G. Grothendieck (who should probably be given a knighthood for his many contributions to R's functionality) especially the part regarding the use of an external file: http://code.google.com/p/sqldf/#Example_6._File_Input

And one final thought: After saving your work, shutting down you computer, restrating only with R and loading only your datasets, try a cbind(.... match(..) ) maneuver like this:

cbind(df1,df2[match(df1$STUDENT.NAME,df2$STUDENT.NAME)),])

It won't have the same bells and whistles as merge, but it should be fairly memory efficient and succeed if the problem is just fragmented memory in your current session. These are not partial matches. If that was your expectation, you should have indicated such. Names are notoriously messy if coming from independent sources.

cbind is the memory muncher here. and this won't work when you have multiple matches. — Joris Meys, Sep 16 '11 at 23:26

How to efficiently merge two datasets?

3 Answers3

Linked