R - Merge two datatables and remove duplicates from older file?

Question

I have two databases - old one and update one. Both have same structures, with unique ID. If record changes - there's new record with same ID and new data. So after rbind(m1,m2) I have duplicated records. I can't just remove duplicated ID's, since the data could be updated. There's no way to tell the difference which record is new, beside it being in old file or update file.

How can I merge two tables, and if there's row with duplicated ID, leave the one from newer file?

I know I could add column to both and just ifelse() this, but I'm looking for something more elegant, preferably oneliner.

Please share a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including a small example of your data used. — Martin Gal, Jul 01 '20 at 12:08

Wimpel · Answer 1 · 2020-07-01T12:28:55.820

hard to give the correct answer without sample data.. but here is an approach that you can adjust to your data..

#sample data
library( data.table )
dt1 <- data.table( id = 2:3, value = c(2,4))
dt2 <- data.table( id = 1:2, value = c(2,6))
#dt1
#    id value
# 1:  2     2
# 2:  3     4
#dt2
#    id value
# 1:  1     2
# 2:  2     6

#rowbind...
DT <- rbindlist( list(dt1,dt2), use.names = TRUE )
#    id value
# 1:  2     2
# 2:  3     4
# 3:  1     2
# 4:  2     6

#deselect duplicated id from the buttom up
# assuming the last file in the list contains the updated values
DT[ !duplicated(id, fromLast = TRUE), ]
#    id value
# 1:  3     4
# 2:  1     2
# 3:  2     6

louish · Accepted Answer · 2020-07-01T12:23:25.050

1

Say you have:

old <- data.frame(id = c(1,2,3,4,5), val = c(21,22,23,24,25))
new <- data.frame(id = c(1,4), val = c(21,27))

so the record with id 4 has changed in the new dataset and 1 is a pure duplicate.

You can use dplyr::anti_join to find old records not in the new dataset and then just use rbind to add the new ones on.

joined <- rbind(anti_join(old,new, by = "id"),new)

edited Jul 01 '20 at 12:23

answered Jul 01 '20 at 12:11

louish

99
5

score 1 · Answer 3 · answered Jul 01 '20 at 12:19

You could use dplyr:

df_new %>%
  full_join(df_old, by="id") %>%
  transmute(id = id, value = coalesce(value.x, value.y))

returns

   id      value
1   1 0.03432355
2   2 0.28396359
3   3 0.01121692
4   4 0.57214035
5   5 0.67337745
6   6 0.67637187
7   7 0.69178855
8   8 0.83953140
9   9 0.55350251
10 10 0.27050363
11 11 0.28181032
12 12 0.84292569

given

df_new <- structure(list(id = 1:10, value = c(0.0343235526233912, 0.283963593421504, 
0.011216921498999, 0.572140350239351, 0.673377452883869, 0.676371874753386, 
0.691788548836485, 0.839531400706619, 0.553502510068938, 0.270503633422777
)), class = "data.frame", row.names = c(NA, -10L))

df_old <- structure(list(id = c(1, 4, 5, 3, 7, 9, 11, 12), value = c(0.111697669373825, 
0.389851713553071, 0.252179590053856, 0.91874519130215, 0.504653975600377, 
0.616259852424264, 0.281810319051147, 0.842925694771111)), class = "data.frame", row.names = c(NA, 
-8L))

R - Merge two datatables and remove duplicates from older file?

3 Answers3