1

This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.

What I have is a large data.table (let's call it dt_original) and a smaller data.table (let's call it dt_newdata) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original with the values from dt_newdata. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata are larger than the corresponding values in dt_original.

For a reproducible example, here are the data. In the real world the tables are much larger:

library(data.table)
set.seed(0)

## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)

for(i in 2015:2017) {
  varA <- paste0('varA_', i)
  varB <- paste0('varB_', i)
  varC <- paste0('varC_', i)

  dt_original[, (varA) := rnorm(20)]
  dt_original[, (varB) := rnorm(20)]
  dt_original[, (varC) := rnorm(20)]
}

## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)

newdata_vars <- sample(names(dt_original)[-1], 4)

for(var in newdata_vars) {
  dt_newdata[, (var) := rnorm(3)]
}

Here is a way of doing it using a loop and pmax, but there has to be a better way, right?

for(var in newdata_vars) {
  k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
  dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}

It seems like there should be a way using join syntax, and maybe the prefix i. and/or .SD or something like that, but nothing I've tried comes close enough to warrant repeating here.

Community
  • 1
  • 1
Brian Stamper
  • 2,143
  • 1
  • 18
  • 41
  • Yeah, your pmax looks a lot like `dt_original[dt_newdata, on=.(id), pmax(x.varB, i.varB)]`, the only problem being the insistence on passing "varB" as a character... Probably solvable by putting your data in long format instead of wide. – Frank Mar 01 '17 at 16:59
  • Please use `set.seed` to make this reproducible – akrun Mar 01 '17 at 17:00
  • Added `set.seed(0)`, thanks. – Brian Stamper Mar 01 '17 at 17:04
  • Think of `varB` as being a heterogeneous blend of several hundred variable names which is an arbitrary subset of the variables in `dt_original`. This example might make it look a little too regular. – Brian Stamper Mar 01 '17 at 17:09
  • Apologies for confusion, I made edits to the example to make the previous point more clear, now `newdata_vars` is more clearly an arbitrary subset of the variables. (Previously I just used the varB_yyyy variables). – Brian Stamper Mar 01 '17 at 17:27
  • 1
    It doesn't matter whether the names are heterogeneous. They're apparently all numeric (judging by your use of pmax), which means they'd work very well together in a single column in a long-form data set. That is `melt(dt_original, id="id")`. From there, you'd probably also want to split the var name into component parts...(variable and year). If you're interested in pursuing this, I'd recommend checking out Hadley's article on it: https://www.jstatsoft.org/article/view/v059i10 – Frank Mar 01 '17 at 18:23

1 Answers1

4

This code should work in the current format given your criteria.

dt_original[dt_newdata, names(dt_newdata) := Map(pmax, mget(names(dt_newdata)), dt_newdata)]

It joins to the IDs that match between the data.tables and then performs an assignment using := Because we want to return a list, I use Map to run pmax through the columns of data.tables matching by the name of dt_newdata. Note that it is necessary that all names of dt_newdata are in dt_original data.

Following Frank's comment, you can remove the first column of the Map list items and the column names using [-1] because they are IDs, which don't need to be computed. Removing the first column from Map avoids one pass of pmax and also preserves the key on id. Thanks to @brian-stamper for pointing out the key preservation in the comments.

dt_original[dt_newdata,
            names(dt_newdata)[-1] := Map(pmax,
                                         mget(names(dt_newdata)[-1]),
                                         dt_newdata[, .SD, .SDcols=-1])]

Note that the use of [-1] assumes that the ID variable is located in the first position of new_data. If it is elsewhere, you could change the index manually or use grep.

lmo
  • 37,904
  • 9
  • 56
  • 69
  • 1
    Yeah, looks alright. Dunno if you want it on *all* columns, though; might need to exclude `id`. – Frank Mar 01 '17 at 18:50
  • I think you might have had it right before the edit - I'm trying it both ways and with the `[-1]` it looks like `pmax` is trying to compare all rows of `dt_original` against `dt_newdata`. – Brian Stamper Mar 02 '17 at 15:24
  • OK. I returned the original code and then added working code for ignoring the id column. – lmo Mar 02 '17 at 15:45
  • 1
    Yes, that's it, most excellent. Interesting side note - if you don't skip the ID column this will wipe out the fact that it is a key, so you have to `setkey` again on the result (if you want keyed data). The `[-1]` version preserves the key. – Brian Stamper Mar 02 '17 at 16:04