How can I perform full outer joins of large data sets in R?

Question

I am trying to do data analysis in R on a group of medium sized datasets. One of the analyses I need to do requires me to do a full outer join amongst around 24-48 files, each of with has about 60 columns and up to 450,000 lines. So I've been running into memory issues a lot.

I thought at ffbase or sqldf would help, but apparently full outer join is not possible with either of them.

Is there a workaround? A package I haven't found yet?

Additionally, how is it you think that sqldf cannot do full outer joins? sqlite is slightly tricky, in that you have to do the left/right joins and then union them, but sqldf supports other db backends as well... — joran, Jun 06 '13 at 14:38
If you write in sqldf with a outer join statement, it hits a warning that says explicitly outer joins are not supported yet. — Drew75, Jun 06 '13 at 15:04
The message says that RIGHT and FULL outer joins aren't supported. There's one remaining, and it's all you need. Not to mention the fact that you still have other db backends available to you besides SQLite. — joran, Jun 06 '13 at 15:19
But with only 3-4 GB of RAM, I'm suddenly skeptical that you'll have much luck doing outer joins on 25+ tables each with hundreds of thousands of rows, no matter what the method is you use. — joran, Jun 06 '13 at 15:20
The RAM is not necessarily a constraint with a data base as long as you do it out of memory. With sqldf specify `sqldf(..., dbname = tempfile())` . — G. Grothendieck, Jun 06 '13 at 16:18
The tempfile() should help with the memory problem, which is why I want to use sqldf. I'll give it a shot with LEFT OUTER JOIN and see if that's possible. — Drew75, Jun 07 '13 at 07:54
This works with LEFT OUTER JOIN, but I should perhaps post a new question. For a second task I still need FULL OUTER JOIN on the same files... — Drew75, Jun 07 '13 at 08:38

score 8 · Accepted Answer · edited May 23 '17 at 11:54

8

Here is a simple example that illustrates how to do outer joins of several datasets:

library(sqldf)
dat1 <- data.frame(x = 1:5,y = letters[1:5])
dat2 <- data.frame(w = 3:8,z = letters[3:8])
> 
> sqldf("select * from dat1 left outer join dat2 on dat1.x = dat2.w UNION 
+       select * from dat2 left outer join dat1 on dat1.x = dat2.w")
  x y  w    z
1 1 a NA <NA>
2 2 b NA <NA>
3 3 c  3    c
4 4 d  4    d
5 5 e  5    e
6 6 f NA <NA>
7 7 g NA <NA>
8 8 h NA <NA>

There it is, a full outer join using sqldf and SQLite as a backend.

As I also mentioned, sqldf support more back ends than SQLite. A single Google search reveals that full outer joins are accomplished the exact same way in MySQL. I am less familiar with postgres but this question sure suggests that full outer joins are possible there as well.

edited May 23 '17 at 11:54

Community

1
1

answered Jun 08 '13 at 20:01

joran

169,992
32
429
468

I did the above but somehow I am getting twice the no. of rows as opposed to merge(x = dat1, y = dat2, by.x = "x", by.y = "y", all = T). I have no idea why! – Nikhil Vidhani Dec 01 '17 at 15:56
@NikhilVidhani I'm not sure what you mean. That code you wrote returns an error, because `by.y = "y"` should be `by.y = "w"`, in which case the output is the same as above. – joran Dec 01 '17 at 16:08
This works only if there are no duplicate rows in either `dat1` or `dat2`, which is the case here. `UNION` returns unique rows and would remove the duplicates. – Waldi Apr 24 '21 at 16:41
@Waldi An edit to the answer simply adding a note that one can use either UNION or UNION ALL depending on whether one wants to preserve the duplicates would be more helpful than a comment saying the answer "doesn't work", which is technically not true. – joran Apr 27 '21 at 18:12
@Joran, before posting my comment I upvoted your answer because I found it useful and relevant in this case. However, I still think that my comment is helpful because if you use UNION ALL, you'll get the double amount of **common rows** which is probably not what is wanted, and makes it different than FULL OUTER JOIN. – Waldi Apr 27 '21 at 18:24
To demonstrate this, you can just run above code with UNION ALL => 11 rows , but expected is 8 – Waldi Apr 27 '21 at 18:30
@Waldi Either way, I would prefer that you simply improve the answer. If it weren't the accepted answer, frankly, I would just delete it to avoid being bothered. – joran Apr 27 '21 at 18:39

score 4 · Answer 2 · answered Oct 15 '14 at 13:46

4

Without sqldf, here is a smart and simple solution :

merge(a, b, by = "col", all = T)

FX

answered Oct 15 '14 at 13:46

FX Jollois

41
1

score 0 · Answer 3 · edited May 23 '17 at 12:17

If you are using ffbase, you can get to your desired result of a full outer join if you combine expand.ffgrid with merge.ffdf. expand.ffgrid is like expand.grid but works with ff vectors so it will not overblow your RAM and merge.ffdf allows to merge with another ffdf without overblowing your RAM and storing data on disk. An example below.

require(ffbase)
x <- ffseq(1, 10000)
y <- ff(factor(LETTERS))
allcombinations <- expand.ffgrid(x, y)
addme <- data.frame(Var1 = c(1, 2), Var2 = c("A","B"), measure = rnorm(2))
addme <- as.ffdf(addme)
myffdf <- merge(allcombinations, addme, by.x=c("Var1","Var2"), by.y=c("Var1","Var2"),  all.x=TRUE)
myffdf[1:10,]

Next, look at delete rows ff package on how to subset that resulting myffdf.

Do have a look at ?ffbase::expand.ffgrid and ?ffbase::merge.ffdf

score 0 · Answer 4 · answered Nov 29 '13 at 04:27

This might work (note: the key column must be the first column in every dataset).

library(ff)
library(ffbase)

fullouterjoin <- function(ffdf1, ffdf2){

    # do a left outer join
    leftjoin <- merge(ffdf1, ffdf2, by = "key", all.x = TRUE)

    # do a right outer join (it's just a left outer join with the objects swapped)
    rightjoin <- merge(ffdf2, ffdf1, by = "key", all.x = TRUE)

    # swap the column orders (make ffd1 columns first and ffd2 columns later)
    srightjoin <- rightjoin[c(names(ffdf1), names(ffdf2)[2:length(ffdf2)])]

    # stack left outer join on top of the (swapped) right outer join
    stacked <- rbind(leftjoin, srightjoin)

    # remove duplicate rows
    uniques <- unique(stacked)

    # that's it
    return(uniques)
}

usage:

newffdf <- fullouterjoin(some_ffdf, another_ffdf)

I'm not saying it's fast, but it might overcome the memory barrier.

I'm working with much larger data sets. I've tried your method and I get "Error in `row.names<-.data.frame`(`*tmp*`, value = rownam[as.integer(i2)]) : duplicate 'row.names' are not allowed" upon executing the uniques <- unique(stacked) part. Do you have any idea where is this coming from? — Nikolay Nenov, Jul 02 '14 at 12:51

How can I perform full outer joins of large data sets in R?

4 Answers4