1

I try to combine two data.tables in R based on a common ID but varying columns and I also want to drop duplicate ID rows. My approach would be:

dt1                       dt2
ID X1 Y1 Z1               ID X2 Y2 Z2        
1  a1 a2 a3               1  A1 A2 A3
2  b1 b2 b3               2  B1 NA B3
3  c1 c2 NA               3  C1 C2 C3
4  d1 d2 d3               5  E1 E2 E3
6  f1 f2 f3               6  F1 F2 F3 

Using rbind(dt1, dt2, fill = TRUE) gives me:

dt_merged
ID X1 Y1 Z1 X2 Y2 Z2
1  a1 a2 a3 NA NA NA
1  NA NA NA A1 A2 A3
2  b1 b2 b3 NA NA NA
2  NA NA NA B1 NA B3
3  c1 c2 NA NA NA NA
3  NA NA NA C1 C2 C3
4  d1 d2 d3 NA NA NA
5  NA NA NA E1 E2 E3
6  f1 f2 f3 NA NA NA
6  NA NA NA F1 F2 F3

My problem is now that I don´t know how to merge the duplicate row IDs and fill in the NAs with the corresponding data from the duplicate ID rows. My desired output data.table would be:

ID X1 Y1 Z1 X2 Y2 Z2
1  a1 a2 a3 A1 A2 A3
2  b1 b2 b3 B1 NA B3
3  c1 c2 NA C1 C2 C3
4  d1 d2 d3 NA NA NA
5  NA NA NA E1 E2 E3
6  f1 f2 f3 F1 F2 F3

I hope my stated description is good enough to give you an overview of my problem. Any kind of help would be higly appreciated by me and excuse me for my foolish question but data.table wrangling gives me sometimes a very hard time.

Mr.Morgan
  • 31
  • 1
  • 6

1 Answers1

3

Simply do a full join. It is very simple with the dplyr package. (or the data.table package)

library(dplyr)

dt1 <- data.frame("ID" = c(1,2,3,4,6),
              "X1" = c("a1", "b1", "c1", "d1", "f1"),
              "Y1" = c("a2", "b2", "c2", "d2", "f2"),
              "Z1" = c("a3", "b3", NA, "d3", "f3")
              )                      


dt2 <- data.frame("ID" = c(1,2,3,5,6),
              "X2" = c("A1", "B1", "C1", "E1", "F1"),
              "Y2" = c("A2", NA, "C2", "E2", "F2"),
              "Z2" = c("A3", "B3", "C3", "E3", "F3")
              )                      
dt3 <- full_join(x = dt1, y = dt2, by = "ID") %>%
arrange(ID)

dt4 <- merge(dt1, dt2, by = "ID", all = TRUE)

dt3
dt4 

Updated: If you ever need to join more tables (as per OP's comment), just chain them:

dt5 <- data.frame("ID" = c(1,3,4,5,7),
              "X3" = c("A1", "C1", "D1", "E1","G1"),
              "Y3" = c(NA, "C2", "D2", "E2", "G2"),
              "Z3" = c("A3","C3", "D3", "E3", NA)
)                      


dt6 <- full_join(x = dt1, y = dt2, by = "ID") %>%
  full_join( x = ., y = dt5, by = "ID") %>%
  arrange(ID)

dt6
The Lyrist
  • 434
  • 3
  • 13
  • Okay, thank you very much!!! That works just perfect. But let me ask you one quick follow up because I want to learn some more about r data wrangling techniques: What can I do if I have not only 2 tables but for example 15 different tables. "Merge" and "Joins" seem only to take two tables. This would be great to know for the future! – Mr.Morgan Jun 20 '18 at 00:34
  • You're welcome. If you ever need to join more tables, just chain them. I have updated the answer to illustrate how to do that. – The Lyrist Jun 20 '18 at 00:47
  • 2
    @TheLyrist An alternative to chaining that I like is `purrr`'s `reduce`: `purrr::reduce(list(df1, df2, df3, df4, df5), left_join, by = "Var")`. – hmhensen Jun 20 '18 at 00:50
  • @trijamms Thanks! Glad that I am learning something new. – The Lyrist Jun 20 '18 at 00:52
  • yes thank you all. Very good advices!! (but I think there should be a "full_join" not "left_join"). – Mr.Morgan Jun 20 '18 at 01:18