2

I would like to wrap mget() in a simple function so that it returns an unnamed list, and use this function within data.table j.

I've printed out the environment within the function body passed to data.table j. I found data.table j uses one environment when calling my function, and another when using unname(mget()). I have tried playing around with inherits, but used inherits=F here to be more strict about where we find the relevant variables.

This approach works:

library(data.table); library(purrr) 
# a list of functions the user can access
functionDictionary <- list(
  sum = sum,
  weighted_sum = function(x,y) sum(x)/sum(y)
)

grouping_vars <- c('cyl', 'vs')

# user defines here which calculations they wish to make with which 
# columns
userList <- list(
  reactive = list(names = c('my_var1', 'my_var2'),
                  calculations = list(
                    sum = c('hp'),
                    weighted_sum=c('hp', 'mpg')
                  ))
)

mtcars <- data.table(mtcars)
mtcars[,
         {
           env <- environment() # get env in datatable j
           print('grouping')
           print(names(env))
           functionList <- 
             map2(names(userList[['reactive']]$calculations), 
                  userList[['reactive']]$calculations,        
                    ~ do.call(functionDictionary[[.x]],   
                              unname(mget(.y, envir=env, 
                                          inherits=F)))
             )
           functionList # last expression in `{` is returned
         }
         , 
         by=grouping_vars
         ]


However, adding a simple wrapper to mget() fails to find 'hp', and indeed, it is not listed in the environment of the function body passed to data.table j.


mget_unnamed <- function(x,...) unname(mget(x, inherits=F, ...))

mtcars[,
         {
           env <- environment() # get env in datatable j
           print('grouping')
           print(names(env))
           functionList <- 
             map2(names(userList[['reactive']]$calculations), 
                  userList[['reactive']]$calculations,        
                    ~ do.call(functionDictionary[[.x]],   
                             mget_unnamed(.y, envir=env))
             )
           functionList # last expression in `{` is returned
         }
         , 
         by=grouping_vars
         ]

The error is: "Error: value for ‘hp’ not found."

Hayden Y.
  • 448
  • 2
  • 8
  • 2
    My assumption since I'm not familiar with `data.table`'s internals: if `data.table` can explicitly "see" that `mget` is used in your code, it will expose all variables from the table. Your `mget_unnamed` function "hides" `mget`, so `data.table` only exposes what it believes the code still uses based on what it sees. Other R functions work similarly, [for example `nls`](https://stackoverflow.com/questions/56704424/how-do-estimation-commands-find-variable-names-in-formulas-in-r). – Alexis Aug 16 '19 at 21:23
  • I don't see a library call to the 'data.table' package. Also `map2` is from a non-base package so that call throws an error in a clean session. – IRTFM Aug 16 '19 at 22:22
  • @42 Thanks. I've added the library call to those packages in the code. – Christopher Aug 16 '19 at 22:35
  • @Alexis You are correct. Using get in the body of the function passed to data.table j above unmasks 'hp'. e.g. { env <- environment(); N <- get('.N'); ... } – Christopher Aug 16 '19 at 22:51
  • Yeah, get/mget has side effects. https://github.com/Rdatatable/data.table/issues/3044#issuecomment-420722940 I try to avoid using it (eg, using quote/substitute to construct a call with the desired column names in it) – Frank Aug 16 '19 at 22:51
  • 1
    @Frank It looks like you were credited here for the use of substitute instead of get(): https://stackoverflow.com/questions/48234064/getx-does-not-work-in-r-data-table-when-x-is-also-a-column-in-the-data-table. I imagine you would do something similar? It also appears that ~do.call(functionDictionary[[.x]], lapply(.y, function(z) eval(as.symbol(z)))) will work here without the need to declare environments. Any recommendations against using eval(as.symbol()) as opposed to quote/substitute? – Christopher Aug 16 '19 at 23:10
  • @ChristopherCampbell Okay, I've thought more about it and posted an answer. If the eval as.symbol way works for you, might as well use it, but I expect you'll eventually find it useful to know the quote/substitute/eval way as well. (Btw, I use as.name below, but it is exactly the same as.symbol, so no reason to choose one over the other.) – Frank Aug 17 '19 at 17:06

1 Answers1

3

Here's one way:

ff = function(d, g, uL, dict = functionDictionary){
  r    = uL$reactive
  nms  = r$names
  fns  = names(r$calculations)
  cols = r$calculations

  exprs = lapply(setNames(seq_along(nms), nms), function(ii){
    fx = substitute(dict[[f]], list(f=fns[[ii]]))
    cx = lapply(cols[[ii]], as.name)
    as.call(c(fx, cx))
  })
  cat("The expressions:\n"); print(exprs)

  call = as.call(c(as.name("list"), exprs))
  cat("The call:\n"); print(call)

  d[, eval(call), by=g]
} 

Usage:

ff(mtcars, grouping_vars, userList)

The expressions:
$my_var1
dict[["sum"]](hp)

$my_var2
dict[["weighted_sum"]](hp, mpg)

The call:
list(my_var1 = dict[["sum"]](hp), my_var2 = dict[["weighted_sum"]](hp, 
    mpg))
   cyl vs my_var1   my_var2
1:   6  0     395  6.401945
2:   4  1     818  3.060232
3:   6  1     461  6.026144
4:   8  0    2929 13.855251
5:   4  0      91  3.500000

Comment. The map2 function from purrr has NSE of its own (with ~, .x and .y as seen in the OP) in addition to data.table's NSE so things might get messy even if you find a workaround for a particular case (like OP mentions eval(as.symbol(z)) works here).

I find the base R tools (like quote and substitute) generalize to more of my use cases; and eval is the standard approach to meta-programming with data.table and will allow use of its various optimizations. If those optimizations are important for your use case, you might want to look into changing the functionDictionary interface, since with verbose=TRUE we can see that only the second call below gets "GForce" optimization:

mtcars[, functionDictionary[["sum"]](hp), by=cyl, verbose=TRUE]
# ...
# lapply optimization is on, j unchanged as 'functionDictionary[["sum"]](hp)'
# GForce is on, left j unchanged
# ...
mtcars[, sum(hp), by=cyl, verbose=TRUE]
# ...
# lapply optimization is on, j unchanged as 'sum(hp)'
# GForce optimized j to 'gsum(hp)'
# ...
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Thanks! I learned a lot from this post. I'll accept your answer once my edit is approved. It seems like the best part of quote/substitute is that you can produce calls which are very similar to the 'typical' `data.table` syntax. My use case is to template shiny app production, so optimization is a concern, and so the _comment_ is useful as well. – Christopher Aug 17 '19 at 18:38