4

I would like to filter a SQL database based whether a regular expression appears within any column. I would like to specify the regex as a variable; however it is read as a literal string. I am having trouble getting the regex in as a variable. Thank you for your help!

Resources I've consulted:

Note: I had trouble making a reprex using the mtcars dataset, following https://www.tidyverse.org/blog/2018/01/dbplyr-1-2/. I get the error: "Error: str_detect() is not available in this SQL variant". I cannot share a reprex using my actual SQL database. As such, below is a pseudo-reprex.

library(dplyr)
library(stringr)

# Variable with regex (either lower- or uppercase "m")
my_string <- "(?i)m"

# WITHOUT SQL DATABASE ----------------------------------------------------

# This runs
mtcars %>% 
  tibble::rownames_to_column() %>% 
  filter(str_length(rowname) > 5)

# This runs with STRING
mtcars %>% 
  tibble::rownames_to_column() %>% 
  filter(str_detect(rowname, "(?i)m"))

# This runs with VARIABLE
mtcars %>% 
  tibble::rownames_to_column() %>% 
  filter(str_detect(rowname, my_string))

# WITH SQL DATABASE -------------------------------------------------------

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
mtcars_db <- copy_to(con, tibble::rownames_to_column(mtcars), "mtcars")

# This runs
tbl(con, "mtcars") %>% 
  filter(str_length(rowname) > 5)

# This *should* run with STRING -- pretend it does ;)
tbl(con, "mtcars") %>%
  filter(str_detect(rowname, "M"))

# This does NOT run with VARIABLE
tbl(con, "mtcars") %>%
  filter(str_detect(rowname, my_string))
maia-sh
  • 537
  • 4
  • 14
  • 2
    I'm voting to reopen this; it was closed as a duplicate of a question about using regex in a variable, but that's not the problem here. The OP is able to create & use their regex pattern dynamically—the problem comes with using it with a sqlite database. If that method doesn't apply to sqlite or sql in general, that's not clear from the duplicate target – camille Feb 21 '22 at 15:36
  • Thanks @camille. The issue is indeed with the interplay of stringr/dbplyr/sql and not using the regex in a variable - since that works just fine in a local database. Appreciate any ideas! I'm wondering whether I can force eval of the string with metaprogramming: https://adv-r.hadley.nz/metaprogramming.html – maia-sh Feb 21 '22 at 19:50

2 Answers2

3

This might depend a lot on the flavour of SQL you are using. This issue mentions a translation for str_detect and also provides an alternative.

Testing for SQL Server:

library(dbplyr)
library(dplyr)
library(stringr)

data(mtcars)
df_sim = tbl_lazy(mtcars, con = simulate_mssql())
my_string <- "(?i)m"

df_sim %>%
  filter(str_detect(my_string, gear)) %>%
  show_query()
# Error: str_detect() is not available in this SQL variant

df_sim %>%
  filter(gear %like% my_string) %>%
  show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`gear` like '(?i)m')

So it appears str_detect can not translate to SQL Server. But you can use %like% as a work around.

Testing for MySQL:

library(dbplyr)
library(dplyr)
library(stringr)

data(mtcars)
df_sim = tbl_lazy(mtcars, con = simulate_mysql()) # changed to mysql
my_string <- "(?i)m"

df_sim %>%
  filter(str_detect(my_string, gear)) %>%
  show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE ('(?i)m' REGEXP `gear`)

df_sim %>%
  filter(gear %like% my_string) %>%
  show_query()
# <SQL>
# SELECT *
# FROM `df`
# WHERE (`gear` like '(?i)m')

So it appears str_detect can be translated correctly for MySQL.

In every case my_string is translated into the query.

A couple of other things to check:

  • Updating the version of dbplyr. Older versions do not have all the translations in newer versions.
  • Try a column other than rownames. Data frames in R can have rownames but tables in SQL can only have columns.
Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • 1
    Thanks! I'm using a postgresql db and now have it working with `filter(gear %ILIKE% my_string)`. However, I actually need to filter across multiple dynamically selected columns: `filter(if_any(everything(), ~. %ILIKE% my_string))`. This then reverts to using "my_string" literally. Any suggestions? – maia-sh Feb 22 '22 at 15:19
  • Take a look at [this](https://stackoverflow.com/a/59867547/7742981) answer which uses a loop. And at [this](https://stackoverflow.com/questions/61789717/dplyr-case-when-with-dynamic-number-of-cases) question where multiple cases are built as strings and then passed into a dplyr function. – Simon.S.A. Feb 22 '22 at 19:51
3

With the help of a colleague, I have a solution to force evaluation of a variable with regex in string::str_detect:

tbl(con, "mtcars") %>%
  filter(str_detect(rowname, {{my_string}}))
maia-sh
  • 537
  • 4
  • 14