-1

I have two data tables. I'm merging them in 4 common columns (Main key=col1). However, many rows in dt1 have NA in the 3 subkeys (col2, col3, col4). When merging, they keep NA instead of taking the existing value from the row in dt2 that they were matched with.

Example:

setkey(dt1, letter, number)
setkey(dt2, letter, number)
      dt1                     dt2
letter  number  size      letter  number  color
  a        10   big          a     10    blue
  b        NA   small        b     20    orange
  c        30   big          c     30    yellow
  d        40   big          d     40    red

dt_merged <- merge(dt1, dt2, all=TRUE)

      dt_merged
letter  number  size   color
     a     10   big    blue
     b     NA   small  orange
     c     30   big    yellow
     d     40   big    red

How can I condition the merge to take values from dt2 for these 3 columns (ex. col2,col3,col4) whenever dt1 has NA?

Edit: Added size column, since it previously seemed unnecessary to merge the DTs with the given values.

Aarón Gzz
  • 99
  • 6
  • Can you make your example reproducible? by using dput() and pasting in the output? – s_baldur Jul 09 '20 at 09:33
  • @sindri_baldur sorry, I don't quite understand what you mean. Do you mean to post the actual code and not the letter/number example? – Aarón Gzz Jul 09 '20 at 09:38
  • The idea is that you give us code that recreates dt1 and dt2. Here are some ideas: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – s_baldur Jul 09 '20 at 09:41
  • 1
    Ric S went ahead and did it for me in a separate answer. Sorry, I'll keep that in mind for any further questions I post here. – Aarón Gzz Jul 09 '20 at 13:03

2 Answers2

0

For this sample dataset, it is possible to use coalesce from the dplyr package to solve the issues when dt1 has NAs. If you have more columns (you mentioned col2, col3, col4) you should probably coalesce them all between the two datasets with the same rationale.

library(dplyr)

dt1 %>% 
  mutate(number = coalesce(number, dt2$number)) %>% 
  inner_join(dt2, by = c("letter", "number"))
#   letter number  size  color
# 1      a     10   big   blue
# 2      b     20 small orange
# 3      c     30   big yellow
# 4      d     40   big    red

Data

dt1 <- structure(list(letter = c("a", "b", "c", "d"), number = c(10L, 
NA, 30L, 40L), size = c("big", "small", "big", "big")), class = "data.frame", row.names = c(NA, -4L))

dt2 <- structure(list(letter = c("a", "b", "c", "d"), number = c(10L, 
20L, 30L, 40L), color = c("blue", "orange", "yellow", "red")), class = "data.frame", row.names = c(NA, -4L))
Ric S
  • 9,073
  • 3
  • 25
  • 51
  • I'm trying to coalesce the values but dt1 has integer and dt2 double. After using as.numeric for dt2, I still get an error message regarding the lengths (dt1 has length 151, dt2 has length 5). How can I get it going? – Aarón Gzz Jul 09 '20 at 10:52
  • `coalesce` doesn't make you recycle its inputs if they are different in length.. Then I'm afraid my solution won't work on your real data. I wrote it as from your sample data I expected the tables to have equal size – Ric S Jul 09 '20 at 10:56
  • I'm kind of stumped and don't know what else to try. Everything I attempt (mutate, copy values, coalesce, etc) throws an error because the length of values is different. How can this be? For example, both rows contain "0" and are type integer, but apparently different length and therefore won't be replaced. I don't understand. – Aarón Gzz Jul 09 '20 at 12:58
0

Here is an option:

cols <- paste0("col", 2L:4L)
#look up the missing values from dt2 first before merging
dt1[is.na(col2), (cols) := dt2[.SD, on=.(col1), mget(paste0("x.", cols))]]
merge(dt1, dt2, all=TRUE)

output:

   col1 col2 col3 col4  size  color
1:    a   10   10   10   big   blue
2:    b   20   20   20 small orange
3:    c   30   30   30   big yellow
4:    d   40   40   40   big    red

data:

dt1 <- data.table(col1 = c("a", "b", "c", "d"), 
    col2 = c(10L, NA, 30L, 40L), 
    col3 = c(10L, NA, 30L, 40L), 
    col4 = c(10L, NA, 30L, 40L), 
    size = c("big", "small", "big", "big"))

dt2 <- data.table(col1 = c("a", "b", "c", "d"), 
    col2 = c(10L, 20L, 30L, 40L), 
    col3 = c(10L, 20L, 30L, 40L), 
    col4 = c(10L, 20L, 30L, 40L), 
    color = c("blue", "orange", "yellow", "red"))
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • Could you explain this line of code?: ```dt1[is.na(col2), (cols) := dt2[.SD, on=.(col1), mget(paste0("x.", cols))]]``` I'm having trouble understanding and implementing it. What purpose does .SD and mget serve? – Aarón Gzz Jul 09 '20 at 14:55