5

I have searched for the canonical way to do what I'm trying but I seem to have little luck getting something working that is fast and elegant. In short, I have a large table with multiple value columns and want to multiply each by a corresponding factor from a lookup table. I cannot figure out how to dynamically pass in which columns I want multiplied by the lookup values, or how to refer to the lookup values in general outside of basic expressions.

Here is my example, I have it set up with 3 million rows with 10 value columns, this doesn't take too long and is somewhat representative of the data size (this will be implemented as part of a much larger loop, hence the emphasis on performance). There is also a lookup table with 6 levels and some assorted multipliers for our value_1:value_10 columns.

library(data.table)

setsize <- 3000000
value_num <- 10
factors <- c("factor_a", "factor_b", "factor_c", "factor_d", "factor_e", "factor_f")
random <- data.table(replicate(10, sample(factors, size = setsize,  replace = T))
                     , replicate(10, rnorm(setsize, mean = 700, sd = 50)))
lookup <- data.table("V1" = factors, replicate(10, seq(.90, 1.5, length.out = length(factors))))
wps <- paste("value", c(1:10), sep = "_")
names(random)[11:20] <- wps
names(lookup)[2:11] <- wps
setkeyv(random, "V1")
setkeyv(lookup, "V1")

Solution 1: It is fairly quick but I can't figure out how to generically refer to the i-columns like i.value_1 so I can pass them into a loop or better yet apply them all at once.

f <- function() {
  random[lookup, value_1 := value_1 * i.value_1, by = .EACHI]
  random[lookup, value_2 := value_2 * i.value_2, by = .EACHI]
  random[lookup, value_3 := value_3 * i.value_3, by = .EACHI]
  random[lookup, value_4 := value_4 * i.value_4, by = .EACHI]
  random[lookup, value_5 := value_5 * i.value_5, by = .EACHI]
  random[lookup, value_6 := value_6 * i.value_6, by = .EACHI]
  random[lookup, value_7 := value_7 * i.value_7, by = .EACHI]
  random[lookup, value_8 := value_8 * i.value_8, by = .EACHI]
  random[lookup, value_9 := value_9 * i.value_9, by = .EACHI]
  random[lookup, value_10 := value_10 * i.value_10, by = .EACHI]
}

system.time(f())

   user  system elapsed 
  0.184   0.000   0.181 

Solution 2: After I could not get solution 1 to be generic, I tried a set() based approach. However despite allowing me to specify the targeted value columns in the character vector wps, it is actually much much slower than the above. I know I am using it wrong but am unsure how to improve it to remove all the [.data.table overhead.

idx_groups <- random[,.(rowstart = min(.I), rowend = max(.I)), by = key(random)][lookup]
system.time(
for (i in 1:nrow(idx_groups)){
  rows <- idx_groups[["rowstart"]][i]:idx_groups[["rowend"]][i]
  for (j in wps) {
    set(random, i=rows, j=j, value= random[rows][[j]] * idx_groups[[j]][i])
  }  
})

   user  system elapsed 
  3.940   0.024   3.967 

Any advice on how to better structure these operations would be appreciated.

Edit: I'm very frustrated with myself for failing to try this obvious solution before posting this question:

system.time(
for (col in wps){
  random[lookup, (col) := list(get(col) * get(paste0("i.", col))), by = .EACHI, with = F]
})

   user  system elapsed 
  1.600   0.048   1.652 

which seems to do what I want with relative speed. However it is still 10x slower than the first solution above (I'm sure due to the repeated get()) so I'm still open to advice.

Edit 2: Replacing get() with eval(parse(text=col)) seems to have done the trick.

system.time(
for (col in wps){
  random[lookup, (col) := list(eval(parse(text=col)) * eval(parse(text=paste0("i.", col)))), by = .EACHI, with = F]
})
   user  system elapsed 
  0.184   0.000   0.185 

Edit 3: Several good working answers have been provided. Rafael's solution is probably best in the general case, though I will note that I could squeeze a few more milliseconds out of the call construction recommended by Jangorecki in exchange for a rather intimidating looking helper function. I've marked it as answered, thanks for the help everyone.

etrippler
  • 91
  • 1
  • 7
  • I would guess using `mget` instead of `eval(parse(...))` should achieve the the same result (but did not test it). If you have answered your question yourself please post the correct solution as "answer" (not as edit). THX :-) – R Yoda May 24 '16 at 05:59
  • 1
    This question can be helpful: https://stackoverflow.com/questions/30468455/dynamically-build-call-for-lookup-multiple-columns - you can try the latest solution from there, I believe it should be the most efficient, as avoids parsing and materialization fields from `get`. Feel free to self answer if you find faster/better solution then currently provided. – jangorecki May 24 '16 at 08:41
  • R Yoda, unfortunately `mget` doesn't work for me like `eval(parse(...))` does. Jangorecki, thanks for the link! I think your last solution there is the fastest I have seen and the ability to inspect the J expression makes it a little more intuitive as to what is going on. I will post my version of it as the answer to this question. – etrippler May 24 '16 at 14:11
  • Are the column names to be used for the calculations predefined at "design time" (= when you program your R code) or is it a requirement to support an arbitrary number of columns in the calculation by using some column selection logic (like a regular expression and numbering)? I am asking this because it affects the possible solution/answer... – R Yoda May 24 '16 at 17:41
  • The solution needs to support an arbitrary subset of predefined column names. So if we have `value_1:value_10` it may need to apply to all 10, 5 random, or a single one based on what is passed into the function. But we will know they are all called `value_1:value_10` in advance. – etrippler May 24 '16 at 19:22

3 Answers3

4

You can also use lapply:

cols <- noquote(paste0("value_",1:10))

random[lookup, (cols) := lapply (cols, function(x)  get(x) * get(paste0("i.", x))), by = .EACHI ]

In case your dataset is too big and you want to see a progress bar of your operation, you can use pblapply:

library(pbapply)

random[lookup, (cols) := pblapply(cols, function(x)  get(x) * get(paste0("i.", x))), by = .EACHI ]
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
  • This is only a few milliseconds slower than the call construction and avoids the need for a helper function, so thank you for the answer. I'm curious why you included `with = F` though, it seems like it's not needed and doesn't effect the timing? – etrippler May 24 '16 at 19:44
  • @etrippler, I'm glad to help. You're correct, there is no need to use `with = F`. [Here is an explanation on when to use it](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html) – rafa.pereira May 24 '16 at 19:52
  • 1
    Using `Map` here is much nicer: `(cols) := Map(\`*\`, mget(cols), mget(icols))` where `icols = paste0("i.", cols)`. – Arun May 24 '16 at 20:30
3

This is about 2x slower than text parsing/call construction, but is a bit more readable:

random[lookup, (wps) := Map('*', mget(wps), mget(paste0('i.', wps))), by = .EACHI]
eddi
  • 49,088
  • 6
  • 104
  • 155
2

Thanks to jangorecki for pointing out his answer here, which dynamically builds the J expression using a helper function and then evaluates all at once. It avoids the overhead of parsing/get and seems to be the fastest solution I am going to get. I also like the ability to manually specify the function being called (some instances I might want / instead of *) and to inspect the J expression before it is evaluated.

batch.lookup = function(x) {
  as.call(list(as.name(":="),x
               ,as.call(c(
                 list(as.name("list")),
                 sapply(x, function(x) call("*", as.name(x), as.name(paste0("i.",x))), simplify=FALSE)
               ))
  ))
}

print(batch.lookup(wps))

`:=`(c("value_1", "value_2", "value_3", "value_4", "value_5", 
"value_6", "value_7", "value_8", "value_9", "value_10"), list(value_1 = value_1 * 
    i.value_1, value_2 = value_2 * i.value_2, value_3 = value_3 * 
    i.value_3, value_4 = value_4 * i.value_4, value_5 = value_5 * 
    i.value_5, value_6 = value_6 * i.value_6, value_7 = value_7 * 
    i.value_7, value_8 = value_8 * i.value_8, value_9 = value_9 * 
    i.value_9, value_10 = value_10 * i.value_10))

system.time(
  random[lookup, eval(batch.lookup(wps)), by = .EACHI])

   user  system elapsed 
   0.14    0.04    0.18
etrippler
  • 91
  • 1
  • 7