How to Full Join several datasets in dplyr (ncol do not match)

Question

I have three dataframes (d1, d2, d3), where ncol and nrow do not match across datasets.

I want to combine these datasets by common columns, while preserving all unique columns and rows.

I believe this means I need a full join, which I've attempted with the example code below.

However, when I try to join by keys common to any two or more datasets, I get an error message: it appears that the column common to d2 and d3 throws an error, even though other keys common to only two datasets are ok.

How should I join these three datasets?

d1 <- data.frame(Y = "Y1", 
        N=1:3, 
        C= c(1, 3, 2),
        D= c(3, 1, 4),
        E= c(0, 1, 1),
        Q= c(2, 0, 3)) 

d2 <- data.frame(Y = "Y2", 
        N=1:3, 
        E= c(0, 1, 1),
        H= c(3, 1, 4),
        S= c(2, 2, 0)) 


d3 <- data.frame(Y = "Y3", 
        N=1:4, 
        C= c(4, 2, 1, 3),
        E= c(3, 1, 4, 2),
        H= c(1, 3, 2, 1), 
        U= c(3, 3, 1, 1))

join_all <- full_join(d1, d2, d3,
              by = c("Y", "N", "C", "E", "H"))

#Error: Join columns must be present in data.
#x Problem with `H`.

Just for example's sake, this is what I want the join result to look like:

> desired_df <- data.frame(
+   Y= c("Y1", "Y1", "Y1", "Y2", "Y2", "Y2", "Y3", "Y3", "Y3", "Y3"), 
+   N= c(1, 2, 3, 1, 2, 3, 1, 2, 3, 4),
+   C= c(1, 3, 2, "NA", "NA", "NA", 4, 2, 1, 3), 
+   D= c(3, 1, 4, "NA", "NA", "NA", "NA", "NA", "NA", "NA"), 
+   E= c(0, 1, 1, 0, 1, 1, 3, 1, 4, 2),
+   H= c("NA","NA","NA", 3, 1, 4, 1, 3, 2, 1),
+   Q= c(2, 0, 3, "NA", "NA", "NA", "NA", "NA", "NA", "NA"), 
+   S= c("NA", "NA", "NA", 2, 2, 0, "NA", "NA", "NA", "NA"), 
+   U= c("NA", "NA", "NA", "NA", "NA", "NA", 3, 3, 1, 1))

> desired_df
    Y N  C  D E  H  Q  S  U
1  Y1 1  1  3 0 NA  2 NA NA
2  Y1 2  3  1 1 NA  0 NA NA
3  Y1 3  2  4 1 NA  3 NA NA
4  Y2 1 NA NA 0  3 NA  2 NA
5  Y2 2 NA NA 1  1 NA  2 NA
6  Y2 3 NA NA 1  4 NA  0 NA
7  Y3 1  4 NA 3  1 NA NA  3
8  Y3 2  2 NA 1  3 NA NA  3
9  Y3 3  1 NA 4  2 NA NA  1
10 Y3 4  3 NA 2  1 NA NA  1

Finally: once joined, how would I turn the NAs into 0?

Thanks for your help!

score 4 · Accepted Answer · answered May 04 '21 at 05:58

4

You can put the dataframes in a list, perform the join and replace NA with 0.

library(tidyverse)
lst(d1, d2, d3) %>% reduce(full_join) %>% replace(is.na(.), 0)

#    Y N C D E Q H S U
#1  Y1 1 1 3 0 2 0 0 0
#2  Y1 2 3 1 1 0 0 0 0
#3  Y1 3 2 4 1 3 0 0 0
#4  Y2 1 0 0 0 0 3 2 0
#5  Y2 2 0 0 1 0 1 2 0
#6  Y2 3 0 0 1 0 4 0 0
#7  Y3 1 4 0 3 0 1 0 3
#8  Y3 2 2 0 1 0 3 0 3
#9  Y3 3 1 0 4 0 2 0 1
#10 Y3 4 3 0 2 0 1 0 1

In base R :

result <- Reduce(function(x, y) merge(x, y, all = TRUE), list(d1, d2, d3))
result[is.na(result)] <- 0
result

answered May 04 '21 at 05:58

Ronak Shah

377,200
20
156
213

Upvote but the OP wants the `NA`'s, the original data sets have zeros in them. – Rui Barradas May 04 '21 at 06:02
They mentioned at the end `Finally: once joined, how would I turn the NAs into 0?`. So I think they are trying to turn `NA`'s to 0 for the end result? – Ronak Shah May 04 '21 at 06:04
Fantastic, thank you @RonakShah! Such a relief to see it resolved so easily! – slane May 04 '21 at 06:12
Ah! Thank you @RonakShah for the advice on accepting the response! – slane May 04 '21 at 06:51
@RonakShah OK, I hadn't noticed that last sentence, I had only looked at the expected output. – Rui Barradas May 04 '21 at 07:43

TarJae · Answer 2 · 2021-05-04T17:25:33.083

2

Use bind_rows

dplyr::bind_rows(d1, d2, d3)

Output:

    Y N  C  D E  Q  H  S  U
1  Y1 1  1  3 0  2 NA NA NA
2  Y1 2  3  1 1  0 NA NA NA
3  Y1 3  2  4 1  3 NA NA NA
4  Y2 1 NA NA 0 NA  3  2 NA
5  Y2 2 NA NA 1 NA  1  2 NA
6  Y2 3 NA NA 1 NA  4  0 NA
7  Y3 1  4 NA 3 NA  1 NA  3
8  Y3 2  2 NA 1 NA  3 NA  3
9  Y3 3  1 NA 4 NA  2 NA  1
10 Y3 4  3 NA 2 NA  1 NA  1

edited May 04 '21 at 17:25

answered May 04 '21 at 08:33

TarJae

72,363
6
19
66

1

Wow! I wasn't aware of this differentiation between `bind_rows` and `rbind` till date. Thanks for sharing. upvoted. – AnilGoyal May 04 '21 at 08:58

score 1 · Answer 3 · answered May 04 '21 at 16:14

1

We can use rbindlist

library(data.table)
rbindlist(list(d1, d2, d3), fill = TRUE)

answered May 04 '21 at 16:14

akrun

874,273
37
540
662

How to Full Join several datasets in dplyr (ncol do not match)

3 Answers3