This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.
What I have is a large data.table
(let's call it dt_original
) and a smaller data.table
(let's call it dt_newdata
) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original
with the values from dt_newdata
. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata
are larger than the corresponding values in dt_original
.
For a reproducible example, here are the data. In the real world the tables are much larger:
library(data.table)
set.seed(0)
## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)
for(i in 2015:2017) {
varA <- paste0('varA_', i)
varB <- paste0('varB_', i)
varC <- paste0('varC_', i)
dt_original[, (varA) := rnorm(20)]
dt_original[, (varB) := rnorm(20)]
dt_original[, (varC) := rnorm(20)]
}
## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)
newdata_vars <- sample(names(dt_original)[-1], 4)
for(var in newdata_vars) {
dt_newdata[, (var) := rnorm(3)]
}
Here is a way of doing it using a loop and pmax
, but there has to be a better way, right?
for(var in newdata_vars) {
k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}
It seems like there should be a way using join syntax, and maybe the prefix i.
and/or .SD
or something like that, but nothing I've tried comes close enough to warrant repeating here.