2

I wonder if there is a way to prevent arrow from pulling data into R by default when it cannot find a suitable binding.

So that instead of getting the following warning message pulling data into R, arrow will throw an error instead.

Is there an option I can tweak to get this behavior?

I know there is a list of active bindings I can consult on the arrow documentation. However, I would like to work with the default settings mentioned above for faster iteration and experimentation without falling into long computations outside the arrow framework.

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(stringr)

sessionInfo()
#> R version 4.1.3 (2022-03-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur/Monterey 10.16
#> 
#> Matrix products: default
#> BLAS:   /opt/R/4.1.3/Resources/lib/libRblas.0.dylib
#> LAPACK: /opt/R/4.1.3/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] stringr_1.4.1 dplyr_1.0.10  arrow_10.0.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.8.1      compiler_4.1.3    highr_0.9         R.methodsS3_1.8.2
#>  [5] R.utils_2.12.0    tools_4.1.3       digest_0.6.29     bit_4.0.4        
#>  [9] evaluate_0.16     lifecycle_1.0.2   tibble_3.1.8      R.cache_0.16.0   
#> [13] pkgconfig_2.0.3   rlang_1.0.6       reprex_2.0.2      DBI_1.1.3        
#> [17] cli_3.4.1         rstudioapi_0.14   yaml_2.3.5        xfun_0.32        
#> [21] fastmap_1.1.0     withr_2.5.0       styler_1.7.0      knitr_1.40       
#> [25] generics_0.1.3    fs_1.5.2          vctrs_0.4.2       bit64_4.0.5      
#> [29] tidyselect_1.1.2  glue_1.6.2        R6_2.5.1          fansi_1.0.3      
#> [33] rmarkdown_2.16    purrr_0.3.4       magrittr_2.0.3    htmltools_0.5.3  
#> [37] assertthat_0.2.1  utf8_1.2.2        stringi_1.7.8     R.oo_1.25.0

df <- tibble(
  date = c("28-Aug-21", "11-Mar-19")
)

df <- arrow::arrow_table(df)

df %>% 
  mutate(date = str_remove(date, "\\d{2}$"))
#> Warning: Expression str_remove(date, "\\d{2}$") not supported in Arrow; pulling
#> data into R
#> # A tibble: 2 × 1
#>   date   
#>   <chr>  
#> 1 28-Aug-
#> 2 11-Mar-

Created on 2022-11-08 with reprex v2.0.2

Many thanks for considering my request.

andreranza
  • 93
  • 6
  • I haven't seen that `pulling data` message. Can you make your question reproducible? (https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info) – r2evans Nov 08 '22 at 12:14
  • 1
    Thanks @r2evans! I have attached the reprex. – andreranza Nov 08 '22 at 12:47
  • If read [here](https://arrow.apache.org/docs/r/articles/dataset.html#querying-the-dataset) this should throw an error, which is the desired behavior for me. However, I've read in other docs that "pulling" into R is contemplated as well. – andreranza Nov 08 '22 at 12:55
  • Very nice reprex, btw, a great edit! – r2evans Nov 08 '22 at 13:00

1 Answers1

3

The error is thrown when dealing with an on-disk object:

arrow::write_parquet(tibble(date = c("28-Aug-21", "11-Mar-19")), "~/StackOverflow/df.parquet")
ds <- arrow::open_dataset("~/StackOverflow/df.parquet")
ds %>%
  mutate(date = stringr::str_remove(date, "\\d{2}$"))
# Error: Expression stringr::str_remove(date, "\\d{2}$") not supported in Arrow
# Call collect() first to pull data into R.

I haven't read much about in-memory arrow objects such as you've defined, but I would suspect the rationale is that if it is already in memory, then the penalty of "pulling data into R" is not a concern.

Incidentally, str_remove (and I suspect most/all of stringr) is not supported by arrow. Correction: much of stringr is supported, but not str_remove. You have a few options to keep the expression "lazy", use either of:

df %>%
  mutate(date = str_replace(date, "\\d{2}$", ""))
df %>%
  mutate(date = sub("\\d{2}$", "", date))

Both work on in-memory and on-disk objects.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thanks! I can see that `str_remove()` is not supported yet [here](https://arrow.apache.org/docs/r/reference/acero.html). So that was intentional because I would have desired an error there rather than a warning. I'll try with on disk objects as you suggested. Thanks again. – andreranza Nov 08 '22 at 13:02
  • 1
    Re: "I haven't read much about in-memory arrow objects such as you've defined, but I would suspect the rationale is that if it is already in memory, then the penalty of "pulling data into R" is not a concern." Yes, that's why. As for "`str_remove` (and I suspect most/all of `stringr`) is not supported by arrow," actually, most of stringr is supported by arrow. And `str_remove` would be easy to add--the docs even tell you how: https://stringr.tidyverse.org/reference/str_remove.html. If either of you were interested in submitting a pull request to add it, we'd be delighted to add it. – Neal Richardson Nov 08 '22 at 13:47
  • https://issues.apache.org/jira/browse/ARROW-14832 is the issue for adding str_remove, for the record. – Neal Richardson Nov 08 '22 at 13:52
  • Thanks for the verification (1) and correction (2), @NealRichardson. – r2evans Nov 08 '22 at 13:55