0

I'm writing code for a simple text miner that encapsulates a few regex's and whos goal is to essentially tick off a certain indicator if that regex is identified (simple enough). Problem is a few of these regex's include special characters ([,],/) that have to be escaped and therefore require a double backslash infront of them to work (I'm using impala as my db, not sure if regex syntax vary between db's). For example (from my R code):

mutate(indicator = case_when(
       sql("(body) RLIKE '(confirmation[ :\\[\\]review]*n[^\\/y])'") ~ "NULL",
       sql("(body) RLIKE '(confirmation[ :\\[\\]review]*y[^\\/n])|(other similar regex)'") ~ "Y", ...,
       TRUE ~ "NULL")) %>% ....

The framework of what I need is the word "confirmation", followed by any of the characters ( ,:,[,],r,e,v,i,e,w), preceded by the letter "y", and not followed by (/,n). This regex works for what I need (if I query in sql). The problem arises when I try the same query in R using dbplyr (code snipet shown below), it seemingly doesn't identify the same substrings I just saw it do in sql. Investigating further (using the show_query argument), I noticed that the dbplyr query removes by default one of the \ in my escape argument:

....
                    CASE
WHEN ((body) RLIKE '(confirmation[ :\[\]review]*n[^\/n])') THEN 'NULL'
WHEN ((body) RLIKE '(confirmation[ :\[\]review]*y[^\/n])|(more similar expressions) THEN 'Y'
WHEN .......
ELSE 'NULL'
END AS `coverage_indicator`, ....

I'm not an expert in regex so I don't know the consequences of this but there are some backslashes in the text I'm mining that cannot be part of those unallowed characters (they're inevitably there). Are there any ways to avoid this disappearing \ or workarounds that allow me to achieve what I need out of the regex?

Note: The entire query is quite long and there is are lot of parts unimportant to this question, If need be I can include more.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
A03
  • 33
  • 5
  • 2
    But generally in R's regex, you need to double-escape backslashes as a backslash is an escape character in both regex **and** in R. So I would either doubling your backslashes in R, `"(body) RLIKE '(confirmation[ :\\\\[\\\\]review]*n[^\\\\/y])'")` or using raw strings `r"{(body) RLIKE '(confirmation[ :\\[\\]review]*n[^\\/y])'}")` – Gregor Thomas Aug 16 '23 at 03:55
  • If you need more help, please provide a small sample input and desired output as well as a working version of the code to test. As to *"The entire query is quite long and there is are lot of parts unimportant to this question, If need be I can include more"* - we certainly don't need more to the query, just some sample data! – Gregor Thomas Aug 16 '23 at 03:56
  • @GregorThomas This quadruple backslash worked perfectly (the sql query came up with a double backslash). Thx! – A03 Aug 16 '23 at 13:13

0 Answers0