I am trying to combine the results of 5 models into a single dataframe for presentation purposes using dplyr
and left_join
. Each model result exists in its own dataframe (dat1 through dat5 for demonstration purposes).
*This is the result of a home-brewed likelihood function, so no summary methods available to exploit through things like mtable
in the memisc
package or the options available in stargazer
.
label1 <- paste0("var", 1:10)
beta1 <- 1:10
se1 <- 1:10*.01
p1 <- 1:10*.005
dat1 <- data.frame(label = label1
,beta = beta1
,se = se1
,p = p1)
label2 <- paste0("var", 1:4)
beta2 <- 1:4
se2 <- 1:4*.01
p2 <- 1:4*.005
dat2 <- data.frame(label = label2
,beta = beta2
,se = se2
,p = p2)
label3 <- paste0("var", 1:3)
beta3 <- 1:3
se3 <- 1:3*.01
p3 <- 1:3*.005
dat3 <- data.frame(label = label3
,beta = beta3
,se = se3
,p = p3)
label4 <- paste0("var", 1:2)
beta4 <- 1:2
se4 <- 1:2*.01
p4 <- 1:2*.005
dat4 <- data.frame(label = label4
,beta = beta4
,se = se4
,p = p4)
label5 <- paste0("var", 1)
beta5 <- 1
se5 <- 1*.01
p5 <- 1*.005
dat5 <- data.frame(label = label5
,beta = beta5
,se = se5
,p = p5)
In regular SQL, I would expect a LEFT JOIN function to behave as it does in sqldf
as shown below.
sqldf(
"
select *
from dat1
left join dat2
on dat1.label = dat2.label
left join dat3
on dat1.label = dat3.label
left join dat4
on dat1.label = dat4.label
"
)
#label beta se p label beta se p label beta se p label beta se p
#1 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005
#2 var2 2 0.02 0.010 var2 2 0.02 0.010 var2 2 0.02 0.010 var2 2 0.02 0.010
#3 var3 3 0.03 0.015 var3 3 0.03 0.015 var3 3 0.03 0.015 <NA> NA NA NA
#4 var4 4 0.04 0.020 var4 4 0.04 0.020 <NA> NA NA NA <NA> NA NA NA
#5 var5 5 0.05 0.025 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#6 var6 6 0.06 0.030 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#7 var7 7 0.07 0.035 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#8 var8 8 0.08 0.040 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#9 var9 9 0.09 0.045 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#10 var10 10 0.10 0.050 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
Ignoring the repeated column headers from regular SQL, I can replicate the same thing in dplyr
as shown below.
dat1 %>% left_join(dat2, by = c("label" = "label")) %>%
left_join(dat3, by = c("label" = "label")) %>%
left_join(dat4, by = c("label" = "label"))
#label beta.x se.x p.x beta.y se.y p.y beta.x se.x p.x beta.y se.y p.y
#1 var1 1 0.01 0.005 1 0.01 0.005 1 0.01 0.005 1 0.01 0.005
#2 var2 2 0.02 0.010 2 0.02 0.010 2 0.02 0.010 2 0.02 0.010
#3 var3 3 0.03 0.015 3 0.03 0.015 3 0.03 0.015 NA NA NA
#4 var4 4 0.04 0.020 4 0.04 0.020 NA NA NA NA NA NA
#5 var5 5 0.05 0.025 NA NA NA NA NA NA NA NA NA
#6 var6 6 0.06 0.030 NA NA NA NA NA NA NA NA NA
#7 var7 7 0.07 0.035 NA NA NA NA NA NA NA NA NA
#8 var8 8 0.08 0.040 NA NA NA NA NA NA NA NA NA
#9 var9 9 0.09 0.045 NA NA NA NA NA NA NA NA NA
#10 var10 10 0.10 0.050 NA NA NA NA NA NA NA NA NA
In regular SQL, I can add a 5th table to the mix and get the expected result.
sqldf(
"
select *
from dat1
left join dat2
on dat1.label = dat2.label
left join dat3
on dat1.label = dat3.label
left join dat4
on dat1.label = dat4.label
left join dat5
on dat1.label = dat5.label
"
)
#label beta se p label beta se p label beta se p label beta se p label beta se p
#1 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005
#2 var2 2 0.02 0.010 var2 2 0.02 0.010 var2 2 0.02 0.010 var2 2 0.02 0.010 <NA> NA NA NA
#3 var3 3 0.03 0.015 var3 3 0.03 0.015 var3 3 0.03 0.015 <NA> NA NA NA <NA> NA NA NA
#4 var4 4 0.04 0.020 var4 4 0.04 0.020 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#5 var5 5 0.05 0.025 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#6 var6 6 0.06 0.030 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#7 var7 7 0.07 0.035 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#8 var8 8 0.08 0.040 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#9 var9 9 0.09 0.045 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#10 var10 10 0.10 0.050 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
In dplyr
, although I appear to join dat5
, I end up dropping dat3
and dat4
, and repeating the results of dat1
and dat2
.
dat1 %>% left_join(dat2, by = c("label" = "label")) %>%
left_join(dat3, by = c("label" = "label")) %>%
left_join(dat4, by = c("label" = "label")) %>%
left_join(dat5, by = c("label" = "label"))
#label beta.x se.x p.x beta.y se.y p.y beta.x se.x p.x beta.y se.y p.y beta se p
#1 var1 1 0.01 0.005 1 0.01 0.005 1 0.01 0.005 1 0.01 0.005 1 0.01 0.005
#2 var2 2 0.02 0.010 2 0.02 0.010 2 0.02 0.010 2 0.02 0.010 NA NA NA
#3 var3 3 0.03 0.015 3 0.03 0.015 3 0.03 0.015 3 0.03 0.015 NA NA NA
#4 var4 4 0.04 0.020 4 0.04 0.020 4 0.04 0.020 4 0.04 0.020 NA NA NA
#5 var5 5 0.05 0.025 NA NA NA 5 0.05 0.025 NA NA NA NA NA NA
#6 var6 6 0.06 0.030 NA NA NA 6 0.06 0.030 NA NA NA NA NA NA
#7 var7 7 0.07 0.035 NA NA NA 7 0.07 0.035 NA NA NA NA NA NA
#8 var8 8 0.08 0.040 NA NA NA 8 0.08 0.040 NA NA NA NA NA NA
#9 var9 9 0.09 0.045 NA NA NA 9 0.09 0.045 NA NA NA NA NA NA
#10 var10 10 0.10 0.050 NA NA NA 10 0.10 0.050 NA NA NA NA NA NA
Am I porting the join to dat5
in dplyr
properly?
Is it possible to execute this many joins in dplyr
?
EDIT1: I believe this is distinct from (How to perform multiple left joins using dplyr in R). Reduce
appeared to solve the problem outlined there.
In my case, Reduce
produces the same result shown in my last code chunk above.
EDIT2: To be clear, I am not concerned with the multiple left-join syntax. I am trying to determine why greater than 4 joins do not behave as they do in "regular" SQL.
EDIT 3: While I had initially accepted an answer from @akrun below, I realized now that the following output:
lst <- lapply(mget(paste0("dat", 1:5)), transform, label2 = label)
suppressWarnings( Reduce(function(...) left_join(..., by = "label"), lst))
#label beta.x se.x p.x label2.x beta.y se.y p.y label2.y beta.x se.x p.x label2.x beta.y se.y p.y label2.y beta se p label2
#1 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005 var1
#2 var2 2 0.02 0.010 var2 2 0.02 0.010 var2 2 0.02 0.010 var2 2 0.02 0.010 var2 NA NA NA <NA>
# 3 var3 3 0.03 0.015 var3 3 0.03 0.015 var3 3 0.03 0.015 var3 3 0.03 0.015 var3 NA NA NA <NA>
# 4 var4 4 0.04 0.020 var4 4 0.04 0.020 var4 4 0.04 0.020 var4 4 0.04 0.020 var4 NA NA NA <NA>
# 5 var5 5 0.05 0.025 var5 NA NA NA <NA> 5 0.05 0.025 var5 NA NA NA <NA> NA NA NA <NA>
# 6 var6 6 0.06 0.030 var6 NA NA NA <NA> 6 0.06 0.030 var6 NA NA NA <NA> NA NA NA <NA>
# 7 var7 7 0.07 0.035 var7 NA NA NA <NA> 7 0.07 0.035 var7 NA NA NA <NA> NA NA NA <NA>
# 8 var8 8 0.08 0.040 var8 NA NA NA <NA> 8 0.08 0.040 var8 NA NA NA <NA> NA NA NA <NA>
# 9 var9 9 0.09 0.045 var9 NA NA NA <NA> 9 0.09 0.045 var9 NA NA NA <NA> NA NA NA <NA>
# 10 var10 10 0.10 0.050 var10 NA NA NA <NA> 10 0.10 0.050 var10 NA NA NA <NA> NA NA NA <NA>
Is still distinct from
sqldf(
"
select *
from dat1
left join dat2
on dat1.label = dat2.label
left join dat3
on dat1.label = dat3.label
left join dat4
on dat1.label = dat4.label
left join dat5
on dat1.label = dat5.label
"
)
#label beta se p label beta se p label beta se p label beta se p label beta se p
#1 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005 var1 1 0.01 0.005
#2 var2 2 0.02 0.010 var2 2 0.02 0.010 var2 2 0.02 0.010 var2 2 0.02 0.010 <NA> NA NA NA
#3 var3 3 0.03 0.015 var3 3 0.03 0.015 var3 3 0.03 0.015 <NA> NA NA NA <NA> NA NA NA
#4 var4 4 0.04 0.020 var4 4 0.04 0.020 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#5 var5 5 0.05 0.025 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#6 var6 6 0.06 0.030 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#7 var7 7 0.07 0.035 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#8 var8 8 0.08 0.040 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#9 var9 9 0.09 0.045 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
#10 var10 10 0.10 0.050 <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA <NA> NA NA NA
I am still losing dat3
and dat4
...
EDIT 4: Not sure why, but my confusion below may have been partly a version issue. @akrun's final answer works on dplyr_0.4.3
and R version 3.3.0
(apparently Linux (Ubuntu) and PC).
EDIT 5: I think @akrun and I are both running the dev version of dplyr
(which is why I first had problems running @akrun's solution on my windows machine which is using production dplyr
.) As noted here this problem has been addressed in the most recent dev version of dplyr
.