Merge different datasets based on condition in R data.table

Question

Suppose I have a major dataset called "main" and three other datasets called "df1", "df2" and "df3".

"main" includes: year, colA, colB, colC
"df1" includes: year, colD
"df2" includes: year, colE
"df3" includes: year, colF

I want to merge colD, colE, colF of df1, df2, df3 to "main" using the year column to rename that merged column to "merged_value".

The logic is like this:

If colA == "Category1", then merge df1 to "main"
If colA == "Category2", then merge df2 to "main"
If colA == "Category3", then merge df3 to "main"

The only way I could think of now is to break the main into 3 subsets and do the merge (shown below), but are there any simpler approach to this?

main_sub1 = main[colA=="Category1"][df1, on="year"]
main_sub2 = main[colA=="Category2"][df2, on="year"]
main_sub3 = main[colA=="Category3"][df3, on="year"]
main_comb = rbind(main_sub1, main_sub2, main_sub3)

Please make a small reproducible example to illustrate the problem. 2 or 3 rows from each of `main`, `df1`, and `df2` should be plenty (choose rows that illustrate the problem well). — Gregor Thomas, Aug 09 '22 at 19:39
I updated my question with my current approach. I think it'll help illustrate the problem better. Thanks — Jiamei, Aug 09 '22 at 21:47
You updated your question with the approach but we have no sample data. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info — r2evans, Aug 10 '22 at 11:46

score 0 · Answer 1 · answered Aug 10 '22 at 11:55

Try adding colA and its "Category#" to each of the respective df#s, merging on year and colA, coalescing the merged_value:

main[df1[, colA := "Category1"], merged_value := i.colD, on = .(year, colA)]
main[df2[, colA := "Category2"], merged_value := fcoalesce(i.colE, merged_value), on = .(year, colA)]
main[df3[, colA := "Category3"], merged_value := fcoalesce(i.colF, merged_value), on = .(year, colA)]
main
#     year      colA  colB  colC merged_value
#    <int>    <char> <int> <int>        <int>
# 1:  2001 Category1     1     1           11
# 2:  2002 Category2     2     2           12
# 3:  2003 Category3     3     3           13
# 4:  2004 Category4     4     4           NA

Data

main <- data.table(year=2001:2004, colA=paste0("Category",1:4), colB=1:4, colC=1:4)
df1 <- data.table(year=2001:2004, colD=11:14)
df2 <- data.table(year=2001:2004, colE=11:14)
df3 <- data.table(year=2001:2004, colF=11:14)

score 0 · Answer 2 · answered Aug 10 '22 at 14:38

Here is another approach, similar to @r2evans (and using @r2evans data), but utilizing a single merge. Basically rowbind df1, df2, and df3, leveraging the id param

setnames(df1,"colD","val")
rbindlist(setNames(list(df1,df2,df3), paste0("Category",1:3)), id="colA")[main, on=.(colA,year)]

Output:

        colA  year   val  colB  colC
      <char> <int> <int> <int> <int>
1: Category1  2001    11     1     1
2: Category2  2002    12     2     2
3: Category3  2003    13     3     3
4: Category4  2004    NA     4     4

Merge different datasets based on condition in R data.table

2 Answers2