I have the following two data.tables
library(data.table)
dt1 = data.table(index_column = c(12, 17, 29, 34, 46), column1 = c("dog", "cat", "bird", "elephant", "bird"), column2 = c(482, 391, 567, 182, 121))
dt2 = data.table(index_column = c(17, 29, 46), column1 = c("cat", "penguin", "bird"))
> dt1
index_column column1 column2
1: 12 dog 482
2: 17 cat 391
3: 29 bird 567
4: 34 elephant 182
5: 46 bird 121
> dt2
index_column column1
1: 17 cat
2: 29 penguin
3: 46 bird
Upon merging these two data.table solely on the shared index_column
with
merged = merge(dt1, dt2, by="index_column", all=TRUE)
the resulting data.table
is:
index_column column1.x column2 column1.y
1: 12 dog 482 NA
2: 17 cat 391 cat
3: 29 bird 567 penguin
4: 34 elephant 182 NA
5: 46 bird 121 bird
I am interested in replacing values of column1.x
with the values of column1.y
if they are not NA
. Most of the values will be the same, but those that are not (e.g. bird/penguin) should be replaced.
One could do this would an if statement, e.g.
if ((merged$column1.x != merged$column1.y) & !is.na(merged$column1.y)){
merged$column1.x = merged$column1.y
}
merged$column1.y = NULL
My concern is that this isn't a very data.table
solution. It will not scale very well if the data.table
has millions of rows.
How would one conditionally replace values of one column based on another column in an R data.table? Is it more efficient to simply replace one with the other, ignoring the NA's?