3

I am trying to combine the results of 5 models into a single dataframe for presentation purposes using dplyr and left_join. Each model result exists in its own dataframe (dat1 through dat5 for demonstration purposes).

*This is the result of a home-brewed likelihood function, so no summary methods available to exploit through things like mtable in the memisc package or the options available in stargazer.

label1 <- paste0("var", 1:10)
beta1 <- 1:10
se1 <- 1:10*.01
p1 <- 1:10*.005

dat1 <- data.frame(label = label1
                   ,beta = beta1
                   ,se = se1
                   ,p = p1)


label2 <- paste0("var", 1:4)
beta2 <- 1:4
se2 <- 1:4*.01
p2 <- 1:4*.005

dat2 <- data.frame(label = label2
                   ,beta = beta2
                   ,se = se2
                   ,p = p2)


label3 <- paste0("var", 1:3)
beta3 <- 1:3
se3 <- 1:3*.01
p3 <- 1:3*.005

dat3 <- data.frame(label = label3
                   ,beta = beta3
                   ,se = se3
                   ,p = p3)

label4 <- paste0("var", 1:2)
beta4 <- 1:2
se4 <- 1:2*.01
p4 <- 1:2*.005

dat4 <- data.frame(label = label4
                   ,beta = beta4
                   ,se = se4
                   ,p = p4)

label5 <- paste0("var", 1)
beta5 <- 1
se5 <- 1*.01
p5 <- 1*.005

dat5 <- data.frame(label = label5
                   ,beta = beta5
                   ,se = se5
                   ,p = p5)

In regular SQL, I would expect a LEFT JOIN function to behave as it does in sqldf as shown below.

sqldf(
"
select * 
from dat1
  left join dat2
    on dat1.label = dat2.label
  left join dat3
    on dat1.label = dat3.label
  left join dat4
    on dat1.label = dat4.label
"
)

#label beta   se     p label beta   se     p label beta   se     p label beta   se     p
#1   var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005
#2   var2    2 0.02 0.010  var2    2 0.02 0.010  var2    2 0.02 0.010  var2    2 0.02 0.010
#3   var3    3 0.03 0.015  var3    3 0.03 0.015  var3    3 0.03 0.015  <NA>   NA   NA    NA
#4   var4    4 0.04 0.020  var4    4 0.04 0.020  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#5   var5    5 0.05 0.025  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#6   var6    6 0.06 0.030  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#7   var7    7 0.07 0.035  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#8   var8    8 0.08 0.040  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#9   var9    9 0.09 0.045  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#10 var10   10 0.10 0.050  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA

Ignoring the repeated column headers from regular SQL, I can replicate the same thing in dplyr as shown below.

dat1 %>% left_join(dat2,  by = c("label" = "label")) %>%
  left_join(dat3,  by = c("label" = "label")) %>%
  left_join(dat4,  by = c("label" = "label"))

#label beta.x se.x   p.x beta.y se.y   p.y beta.x se.x   p.x beta.y se.y   p.y
#1   var1      1 0.01 0.005      1 0.01 0.005      1 0.01 0.005      1 0.01 0.005
#2   var2      2 0.02 0.010      2 0.02 0.010      2 0.02 0.010      2 0.02 0.010
#3   var3      3 0.03 0.015      3 0.03 0.015      3 0.03 0.015     NA   NA    NA
#4   var4      4 0.04 0.020      4 0.04 0.020     NA   NA    NA     NA   NA    NA
#5   var5      5 0.05 0.025     NA   NA    NA     NA   NA    NA     NA   NA    NA
#6   var6      6 0.06 0.030     NA   NA    NA     NA   NA    NA     NA   NA    NA
#7   var7      7 0.07 0.035     NA   NA    NA     NA   NA    NA     NA   NA    NA
#8   var8      8 0.08 0.040     NA   NA    NA     NA   NA    NA     NA   NA    NA
#9   var9      9 0.09 0.045     NA   NA    NA     NA   NA    NA     NA   NA    NA
#10 var10     10 0.10 0.050     NA   NA    NA     NA   NA    NA     NA   NA    NA

In regular SQL, I can add a 5th table to the mix and get the expected result.

sqldf(
  "
select * 
from dat1
  left join dat2
    on dat1.label = dat2.label
  left join dat3
    on dat1.label = dat3.label
  left join dat4
    on dat1.label = dat4.label
  left join dat5
    on dat1.label = dat5.label
"
)

#label beta   se     p label beta   se     p label beta   se     p label beta   se     p label beta   se     p
#1   var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005
#2   var2    2 0.02 0.010  var2    2 0.02 0.010  var2    2 0.02 0.010  var2    2 0.02 0.010  <NA>   NA   NA    NA
#3   var3    3 0.03 0.015  var3    3 0.03 0.015  var3    3 0.03 0.015  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#4   var4    4 0.04 0.020  var4    4 0.04 0.020  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#5   var5    5 0.05 0.025  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#6   var6    6 0.06 0.030  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#7   var7    7 0.07 0.035  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#8   var8    8 0.08 0.040  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#9   var9    9 0.09 0.045  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#10 var10   10 0.10 0.050  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA

In dplyr, although I appear to join dat5, I end up dropping dat3 and dat4, and repeating the results of dat1 and dat2.

dat1 %>% left_join(dat2,  by = c("label" = "label")) %>%
  left_join(dat3,  by = c("label" = "label")) %>%
  left_join(dat4,  by = c("label" = "label")) %>%
  left_join(dat5,  by = c("label" = "label")) 

#label beta.x se.x   p.x beta.y se.y   p.y beta.x se.x   p.x beta.y se.y   p.y beta   se     p
#1   var1      1 0.01 0.005      1 0.01 0.005      1 0.01 0.005      1 0.01 0.005    1 0.01 0.005
#2   var2      2 0.02 0.010      2 0.02 0.010      2 0.02 0.010      2 0.02 0.010   NA   NA    NA
#3   var3      3 0.03 0.015      3 0.03 0.015      3 0.03 0.015      3 0.03 0.015   NA   NA    NA
#4   var4      4 0.04 0.020      4 0.04 0.020      4 0.04 0.020      4 0.04 0.020   NA   NA    NA
#5   var5      5 0.05 0.025     NA   NA    NA      5 0.05 0.025     NA   NA    NA   NA   NA    NA
#6   var6      6 0.06 0.030     NA   NA    NA      6 0.06 0.030     NA   NA    NA   NA   NA    NA
#7   var7      7 0.07 0.035     NA   NA    NA      7 0.07 0.035     NA   NA    NA   NA   NA    NA
#8   var8      8 0.08 0.040     NA   NA    NA      8 0.08 0.040     NA   NA    NA   NA   NA    NA
#9   var9      9 0.09 0.045     NA   NA    NA      9 0.09 0.045     NA   NA    NA   NA   NA    NA
#10 var10     10 0.10 0.050     NA   NA    NA     10 0.10 0.050     NA   NA    NA   NA   NA    NA

Am I porting the join to dat5 in dplyr properly?

Is it possible to execute this many joins in dplyr?

EDIT1: I believe this is distinct from (How to perform multiple left joins using dplyr in R). Reduce appeared to solve the problem outlined there.

In my case, Reduce produces the same result shown in my last code chunk above.

EDIT2: To be clear, I am not concerned with the multiple left-join syntax. I am trying to determine why greater than 4 joins do not behave as they do in "regular" SQL.

EDIT 3: While I had initially accepted an answer from @akrun below, I realized now that the following output:

lst <- lapply(mget(paste0("dat", 1:5)), transform, label2 = label)
suppressWarnings( Reduce(function(...) left_join(..., by = "label"), lst))

#label beta.x se.x   p.x label2.x beta.y se.y   p.y label2.y beta.x se.x   p.x label2.x beta.y se.y   p.y label2.y beta   se     p label2
#1   var1      1 0.01 0.005     var1      1 0.01 0.005     var1      1 0.01 0.005     var1      1 0.01 0.005     var1    1 0.01 0.005   var1
#2   var2      2 0.02 0.010     var2      2 0.02 0.010     var2      2 0.02 0.010     var2      2 0.02 0.010     var2   NA   NA    NA   <NA>
#  3   var3      3 0.03 0.015     var3      3 0.03 0.015     var3      3 0.03 0.015     var3      3 0.03 0.015     var3   NA   NA    NA   <NA>
#  4   var4      4 0.04 0.020     var4      4 0.04 0.020     var4      4 0.04 0.020     var4      4 0.04 0.020     var4   NA   NA    NA   <NA>
#  5   var5      5 0.05 0.025     var5     NA   NA    NA     <NA>      5 0.05 0.025     var5     NA   NA    NA     <NA>   NA   NA    NA   <NA>
#  6   var6      6 0.06 0.030     var6     NA   NA    NA     <NA>      6 0.06 0.030     var6     NA   NA    NA     <NA>   NA   NA    NA   <NA>
#  7   var7      7 0.07 0.035     var7     NA   NA    NA     <NA>      7 0.07 0.035     var7     NA   NA    NA     <NA>   NA   NA    NA   <NA>
#  8   var8      8 0.08 0.040     var8     NA   NA    NA     <NA>      8 0.08 0.040     var8     NA   NA    NA     <NA>   NA   NA    NA   <NA>
#  9   var9      9 0.09 0.045     var9     NA   NA    NA     <NA>      9 0.09 0.045     var9     NA   NA    NA     <NA>   NA   NA    NA   <NA>
#  10 var10     10 0.10 0.050    var10     NA   NA    NA     <NA>     10 0.10 0.050    var10     NA   NA    NA     <NA>   NA   NA    NA   <NA>

Is still distinct from

sqldf(
  "
select * 
from dat1
  left join dat2
    on dat1.label = dat2.label
  left join dat3
    on dat1.label = dat3.label
  left join dat4
    on dat1.label = dat4.label
  left join dat5
    on dat1.label = dat5.label
"
)

#label beta   se     p label beta   se     p label beta   se     p label beta   se     p label beta   se     p
#1   var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005
#2   var2    2 0.02 0.010  var2    2 0.02 0.010  var2    2 0.02 0.010  var2    2 0.02 0.010  <NA>   NA   NA    NA
#3   var3    3 0.03 0.015  var3    3 0.03 0.015  var3    3 0.03 0.015  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#4   var4    4 0.04 0.020  var4    4 0.04 0.020  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#5   var5    5 0.05 0.025  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#6   var6    6 0.06 0.030  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#7   var7    7 0.07 0.035  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#8   var8    8 0.08 0.040  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#9   var9    9 0.09 0.045  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#10 var10   10 0.10 0.050  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA

I am still losing dat3 and dat4...

EDIT 4: Not sure why, but my confusion below may have been partly a version issue. @akrun's final answer works on dplyr_0.4.3 and R version 3.3.0 (apparently Linux (Ubuntu) and PC).

EDIT 5: I think @akrun and I are both running the dev version of dplyr (which is why I first had problems running @akrun's solution on my windows machine which is using production dplyr.) As noted here this problem has been addressed in the most recent dev version of dplyr.

Community
  • 1
  • 1
joemienko
  • 2,220
  • 18
  • 27
  • Possible duplicate of [How to perform multiple left joins using dplyr in R](http://stackoverflow.com/questions/32066402/how-to-perform-multiple-left-joins-using-dplyr-in-r) – Steven Beaupré Jun 10 '16 at 16:56
  • Among the sql code blocks, which one is the expected output? – akrun Jun 10 '16 at 17:09
  • @steven-beaupré...See comments to the answer below. I am not interested in how to avoid the multiple join syntax. I want to figure out why 4+ joins are behaving as shown. – joemienko Jun 10 '16 at 17:10
  • @akrun, the second-to-last sql chunk is my desired end-state. – joemienko Jun 10 '16 at 17:10
  • I guess the `Reduce` gets the same output as the first block of `sql` – akrun Jun 10 '16 at 17:10
  • @akrun, indeed it does. It certainly reduces the left join syntax (as noted in the alleged duplicate), but it still does not behave as I would expect. – joemienko Jun 10 '16 at 17:11
  • Can you check the updated output – akrun Jun 10 '16 at 17:15
  • @akrun, that does it! – joemienko Jun 10 '16 at 17:19
  • @akrun, I spoke too soon! your revised answer does give me a second label column, but it still drops `dat3` and `dat4` (that's the most important part - the rest is superfluous). – joemienko Jun 10 '16 at 17:35
  • Seems like the duplicate column names that aren't being joined on are messing up `dplyr`. Maybe a bug, but a workaround might be to append an data frame identifier to all the column names (except label). I think the joins would work well then, and then the column names of the end result could be modified back to whatever you want. – Gregor Thomas Jun 10 '16 at 17:54
  • @gregor, not sure if this is a bug or not, but I agree that the join functions of dplyr are trying to ensure that the resulting dataframe has unique column headings. In general, I think this is desirable, although it might be nice to be able to ignore (i.e. `unique_columns = FALSE`). This could result in behavior that is similar to left joins in vanilla sql. – joemienko Jun 10 '16 at 18:04

1 Answers1

1

We can use Reduce with left_join after placing the datasets in a list.

Reduce(function(...) left_join(..., by = "label"), mget(paste0("dat", 1:5)))

If we need the label columns, we can create one more column

lst <- lapply(mget(paste0("dat", 1:5)), transform, label2 = label)
lst[[1]]["label2"] <- NULL
res1 <- suppressWarnings( Reduce(function(...) left_join(..., by = "label"), lst))
res1
#   label beta.x se.x   p.x beta.y se.y   p.y label2.x beta.x.x se.x.x p.x.x label2.y beta.y.y se.y.y p.y.y label2.x.x beta   se     p
#1   var1      1 0.01 0.005      1 0.01 0.005     var1        1   0.01 0.005     var1        1   0.01 0.005       var1    1 0.01 0.005
#2   var2      2 0.02 0.010      2 0.02 0.010     var2        2   0.02 0.010     var2        2   0.02 0.010       var2   NA   NA    NA
#3   var3      3 0.03 0.015      3 0.03 0.015     var3        3   0.03 0.015     var3       NA     NA    NA       <NA>   NA   NA    NA
#4   var4      4 0.04 0.020      4 0.04 0.020     var4       NA     NA    NA     <NA>       NA     NA    NA       <NA>   NA   NA    NA
#5   var5      5 0.05 0.025     NA   NA    NA     <NA>       NA     NA    NA     <NA>       NA     NA    NA       <NA>   NA   NA    NA
#6   var6      6 0.06 0.030     NA   NA    NA     <NA>       NA     NA    NA     <NA>       NA     NA    NA       <NA>   NA   NA    NA
#7   var7      7 0.07 0.035     NA   NA    NA     <NA>       NA     NA    NA     <NA>       NA     NA    NA       <NA>   NA   NA    NA
#8   var8      8 0.08 0.040     NA   NA    NA     <NA>       NA     NA    NA     <NA>       NA     NA    NA       <NA>   NA   NA    NA
#9   var9      9 0.09 0.045     NA   NA    NA     <NA>       NA     NA    NA     <NA>       NA     NA    NA       <NA>   NA   NA    NA
#10 var10     10 0.10 0.050     NA   NA    NA     <NA>       NA     NA    NA     <NA>       NA     NA    NA       <NA>   NA   NA    NA
#   label2.y.y
#1        var1
#2        <NA>
#3        <NA>
#4        <NA>
#5        <NA>
#6        <NA>
#7        <NA>
#8        <NA>
#9        <NA>
#10       <NA>

Here is the OP's output from the second sqldf code block

res2
#   label beta   se     p label beta   se     p label beta   se     p label beta   se     p label beta   se     p
#1   var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005  var1    1 0.01 0.005
#2   var2    2 0.02 0.010  var2    2 0.02 0.010  var2    2 0.02 0.010  var2    2 0.02 0.010  <NA>   NA   NA    NA
#3   var3    3 0.03 0.015  var3    3 0.03 0.015  var3    3 0.03 0.015  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#4   var4    4 0.04 0.020  var4    4 0.04 0.020  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#5   var5    5 0.05 0.025  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#6   var6    6 0.06 0.030  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#7   var7    7 0.07 0.035  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#8   var8    8 0.08 0.040  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#9   var9    9 0.09 0.045  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA
#10 var10   10 0.10 0.050  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA  <NA>   NA   NA    NA


dim(res1)
#[1] 10 20
dim(res2)
#[1] 10 20
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Reduce produces the same output given in my last codechunk above...it drops dat3 and dat4. – joemienko Jun 10 '16 at 17:05
  • @joemienko I thought you want to avoid multiple `left_join` syntax – akrun Jun 10 '16 at 17:06
  • my apologies. I don't mind the multiple left join syntax. I am trying to determine why the result behaves as it does. – joemienko Jun 10 '16 at 17:08
  • your revised answer does work - it seems like it would make sense for a user to be able to specify the joins explicitly, however. – joemienko Jun 10 '16 at 17:23
  • @joemienko As we are using the 'label' in the `by` column, it is not available to be returned as another column. So we need to create a copy of that column. – akrun Jun 10 '16 at 17:27
  • I spoke too soon! your revised answer does give me a second label column, but it still drops `dat3` and `dat4` (that's the most important part - the rest is superfluous). – joemienko Jun 10 '16 at 17:34
  • @joemienko I looked at the output. It only gives a single extra column as it also provides `label2` for the first dataset. which we can remove or avoid creating. Regarding the `dat3` and `dat4` part, I compared both the `sqldf` (2nd output) with the new result. Couldn't find any noticeable difference. – akrun Jun 10 '16 at 17:38
  • my apologies again, I had wanted to point you to "second-to-last" output. I have edited the post again to make my desired output more clear. – joemienko Jun 10 '16 at 17:47
  • @joemienko The dimensions are the same for the result. So, I am not sure where exactly it is losing. I guess the order of the columns are a bit different if you look at it closely. – akrun Jun 10 '16 at 17:51
  • You are getting the results that I want - I am just unable to reproduce on my end. – joemienko Jun 10 '16 at 17:53
  • @joemienko I am not sure why you are not getting the same results. I use `R 3.3.0` and `dplyr_0.4.3.9001` – akrun Jun 10 '16 at 17:54
  • dplyr_0.4.3 and R version 3.2.5 here - although I can't imagine that the R version would be causing me difficulty...I will try on a different box and see what I come up with – joemienko Jun 10 '16 at 17:59
  • your solution works on my linux box - I'll call that good enough :) – joemienko Jun 10 '16 at 18:01