5

Problem

I'm currently trying to write a function that filters some rows of a disk.frame object using regular expressions. I, unfortunately, run into some issues with the evaluation of my search string in the filter function. My idea was to pass a regular expression as a string into a function argument (e.g. storm_name) and then pass that argument into my filtering call. I used the %like% function included in {data.table} for filtering rows.

My problem is that the storm_name object gets evaluated inside the disk.frame. However, since the storm_name is only included in the function environment, but not in the disk.frame object, I get the following error:

Error in .checkTypos(e, names_x) : 
  Object 'storm_name' not found amongst name, year, month, day, hour and 8 more

I already tried to evaluate the storm_nameobject in the parent frame using eval(sotm_name, env = parent.env()), but that also didn't help. Interestingly, this problem only occurs with {disk.frame} objects but not with {data.table} objects.

For now I found a solution using {dplyr} instead. However, I would be grateful for any ideas on how this problem could be solved with {data.table}.

Reproducible Example

# Load packages
library(data.table)
library(disk.frame)

# Create data table and diskframe object of storm data
storms_df <- as.disk.frame(storms)
storms_dt <- as.data.table(storms)

# Create search function
grep_storm_name <- function(dfr, storm_name){
  
  dfr[name %like% storm_name]
  
}

# Check function with data.table object
grep_storm_name(storms_dt, "^A")

# Check function with diskframe object
grep_storm_name(storms_df, "^A")

Session Info

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_Sweden.1252  LC_CTYPE=English_Sweden.1252    LC_MONETARY=English_Sweden.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Sweden.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] disk.frame_0.5.0  purrr_0.3.4       dplyr_1.0.7       data.table_1.14.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7            benchmarkmeData_1.0.4 pryr_0.1.4            pillar_1.6.4         
 [5] compiler_4.1.0        iterators_1.0.13      tools_4.1.0           digest_0.6.27        
 [9] bit_4.0.4             jsonlite_1.7.2        lifecycle_1.0.1       tibble_3.1.6         
[13] lattice_0.20-44       pkgconfig_2.0.3       rlang_0.4.12          Matrix_1.3-3         
[17] foreach_1.5.1         rstudioapi_0.13       DBI_1.1.1             parallel_4.1.0       
[21] bigassertr_0.1.4      bigreadr_0.2.4        httr_1.4.2            stringr_1.4.0        
[25] globals_0.14.0        generics_0.1.1        fs_1.5.0              vctrs_0.3.8          
[29] bit64_4.0.5           grid_4.1.0            tidyselect_1.1.1      glue_1.6.0           
[33] listenv_0.8.0         R6_2.5.1              future.apply_1.7.0    parallelly_1.25.0    
[37] fansi_1.0.0           magrittr_2.0.1        codetools_0.2-18      ellipsis_0.3.2       
[41] fst_0.9.4             assertthat_0.2.1      future_1.21.0         benchmarkme_1.0.7    
[45] utf8_1.2.2            stringi_1.7.6         doParallel_1.0.16     crayon_1.4.2 

2 Answers2

4

While I don't know the exact cause of this, it has to do with environments, search path, etc. For instance, these work:

storms_df[name %like% "^A"]

nm <- "^A"
storms_df[name %like% nm]

grep1 <- function(dfr, storm_name) { dfr[name %like% "^A"]; }
grep1(storms_df)

But this does not:

grep2 <- function(dfr, storm_name) { dfr[name %like% storm_name]; }
grep2(storms_df, "^A")
# Error in .checkTypos(e, names_x) : 
#   Object 'storm_name' not found amongst name, year, month, day, hour and 8 more

We can work around this with eval(substitute(..)).

grep3 <- function(dfr, storm_name) { 
  eval(substitute(dfr[name %like% storm_name], list(storm_name = storm_name)))
}
grep3(storms_df, "^A")
#        name  year month   day  hour   lat  long              status category  wind pressure ts_diameter hu_diameter
#      <char> <num> <num> <int> <num> <num> <num>              <char>    <ord> <int>    <int>       <num>       <num>
#   1:    Amy  1975     6    27     0  27.5 -79.0 tropical depression       -1    25     1013          NA          NA
#   2:    Amy  1975     6    27     6  28.5 -79.0 tropical depression       -1    25     1013          NA          NA
#   3:    Amy  1975     6    27    12  29.5 -79.0 tropical depression       -1    25     1013          NA          NA
# ...

(and grep3(storms_dt, "^A") works too)

This works by changing the symbol of storm_name inside the [-expression from storm_name to the literal string. Since this is done on the unevaluated expression, there are no lookups yet, no searching through this and inherited environments to find storm_name.

If you check it manually:

debug(grep3)
grep3(storms_df, "^A")
# debugging in: grep3(storms_df, "^A")
# debug at #1: {
#     eval(substitute(dfr[name %like% storm_name], list(storm_name = storm_name)))
# }
# Browse[2]> 
substitute(dfr[name %like% storm_name], list(storm_name = storm_name))
# dfr[name %like% "^A"]

I think it's something to do with how disk.frame is affecting the environment within [ and the calling/parent environments. Interestingly (to me), you can see that the search path for variables is not empty, it's just not what we would expect:

grep2 <- function(dfr, storm_name) { dfr[name %like% storm_name]; }
grep2(storms_df, "^A")
# Error in .checkTypos(e, names_x) : 
#   Object 'storm_name' not found amongst name, year, month, day, hour and 8 more

### but let's pre-define `storm_name` outside of the function,
### then re-define the function (no change)
storm_name <- "^A"
grep2 <- function(dfr, storm_name) { dfr[name %like% storm_name]; }
head(grep2(storms_df, "^A"), 2)
#      name  year month   day  hour   lat  long              status category  wind pressure ts_diameter hu_diameter
#    <char> <num> <num> <int> <num> <num> <num>              <char>    <ord> <int>    <int>       <num>       <num>
# 1:    Amy  1975     6    27     0  27.5   -79 tropical depression       -1    25     1013          NA          NA
# 2:    Amy  1975     6    27     6  28.5   -79 tropical depression       -1    25     1013          NA          NA

This seems to work, but we can see that it's using the external version of storm_name vice the parametric version, see that name is still starting with A despite the change to "^B".

head(grep2(storms_df, "^B"), 2)
#      name  year month   day  hour   lat  long              status category  wind pressure ts_diameter hu_diameter
#    <char> <num> <num> <int> <num> <num> <num>              <char>    <ord> <int>    <int>       <num>       <num>
# 1:    Amy  1975     6    27     0  27.5   -79 tropical depression       -1    25     1013          NA          NA
# 2:    Amy  1975     6    27     6  28.5   -79 tropical depression       -1    25     1013          NA          NA

Frankly, I don't understand enough of disk.frame's internals to know if this is a bug or a necessity due to what it must do for non-standard data.table-like evaluation of a not-totally-in-memory dataset.


If you're concerned with performance (fair question), the eval(substitute(..)) method does not appear to suffer much:

bench::mark(
  raw = dfr[name %like% "^A"],
  subst = eval(substitute(dfr[name %like% storm_name], list(storm_name = storm_name))),
  iterations = 1000
)
# # A tibble: 2 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                  memory               time               gc                  
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                  <list>               <list>             <list>              
# 1 raw          12.9ms   16.8ms      55.2    1.69MB     3.97   933    67      16.9s <data.table [990 x 13]> <Rprofmem [669 x 3]> <bench_tm [1,000]> <tibble [1,000 x 3]>
# 2 subst        12.8ms   15.8ms      60.5    1.69MB     3.25   949    51      15.7s <data.table [990 x 13]> <Rprofmem [669 x 3]> <bench_tm [1,000]> <tibble [1,000 x 3]>

In repeated benchmarks, I've actually seen subst slightly faster, suggesting that a portion of the performance difference is unrelated to the addition of eval(substitute(..)). This difference (55.2 to 60.5 `itr/sec`) is the worst I've seen it ... a repeat just now had 57.1 and 57.5, so I suggest that performance-degradation is not a concern.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thank you for this workaround and your extensive explanation! It's quite interesting to see that `stom_name` is found when defined outside of the function call. That definitely seems like an odd behavior, but I'm also not sure if this is kind of a bug or some behavior that is a side effect of some intended part of `{disk. frame}`. It would be interesting to see if someone with knowledge on `{disk.frame}`'s internals comes across this question. – Joshua Entrop Jan 21 '22 at 10:04
  • Instead of waiting, you might do better (faster resolution) if you post this as a bug on their repo. It's rarely a guarantee that a package maintainer sees (and replies to) SO questions about their package (with some notable exceptions). – r2evans Jan 21 '22 at 14:57
  • 1
    Yes, that is a good idea. I just posted this also as an [issue](https://github.com/xiaodaigh/disk.frame/issues/369) on the `{disk.frame}` GitHub repository, if someone is interested. – Joshua Entrop Jan 24 '22 at 12:40
2

It now works since disk.frame v0.6

r2evans
  • 141,215
  • 6
  • 77
  • 149
xiaodai
  • 14,889
  • 18
  • 76
  • 140