How to conditionally replace R data.table columns upon merge?

Question

I have the following two data.tables

library(data.table)

dt1 = data.table(index_column = c(12, 17, 29, 34, 46), column1 = c("dog", "cat", "bird", "elephant", "bird"), column2 = c(482, 391, 567, 182, 121))

dt2 = data.table(index_column = c(17, 29, 46), column1 = c("cat", "penguin", "bird"))

> dt1
   index_column  column1 column2
1:           12      dog     482
2:           17      cat     391
3:           29     bird     567
4:           34 elephant     182
5:           46     bird     121


> dt2
   index_column column1
1:           17     cat
2:           29 penguin
3:           46    bird

Upon merging these two data.table solely on the shared index_column with

merged = merge(dt1, dt2, by="index_column", all=TRUE)

the resulting data.table is:

   index_column column1.x column2 column1.y
1:           12       dog     482        NA
2:           17       cat     391       cat
3:           29      bird     567   penguin
4:           34  elephant     182        NA
5:           46      bird     121      bird

I am interested in replacing values of column1.x with the values of column1.y if they are not NA. Most of the values will be the same, but those that are not (e.g. bird/penguin) should be replaced.

One could do this would an if statement, e.g.

if ((merged$column1.x != merged$column1.y) & !is.na(merged$column1.y)){
   merged$column1.x = merged$column1.y
}

merged$column1.y = NULL

My concern is that this isn't a very data.table solution. It will not scale very well if the data.table has millions of rows.

How would one conditionally replace values of one column based on another column in an R data.table? Is it more efficient to simply replace one with the other, ignoring the NA's?

something like `dt[!is.na(column1.y), column1.x := column1.y]` ? — SymbolixAU, Jul 10 '17 at 05:58

score 6 · Accepted Answer · answered Jul 10 '17 at 06:01

6

We can use the on based approach

dt1[dt2, column1 := i.column1, on = .(index_column)]
dt1
#   index_column  column1 column2
#1:           12      dog     482
#2:           17      cat     391
#3:           29  penguin     567
#4:           34 elephant     182
#5:           46     bird     121

answered Jul 10 '17 at 06:01

akrun

874,273
37
540
662

There's no need to merge the data.tables then? – ShanZhengYang Jul 10 '17 at 06:02
1

@ShanZhengYang It is a `join` which I assume gives the expected output without creating additional columns in the input data – akrun Jul 10 '17 at 06:02
1

how would you proceed if you had multiple columns (same length) instead of column1 and also multiple index columns? – Rivka Jul 31 '20 at 12:11

How to conditionally replace R data.table columns upon merge?

1 Answers1