31

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.

Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?

For example (made up function):

x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
Community
  • 1
  • 1
Corvus
  • 7,548
  • 9
  • 42
  • 68

5 Answers5

31

I've written an R version of Perl's quotemeta function:

library(stringr)
quotemeta <- function(string) {
  str_replace_all(string, "(\\W)", "\\\\\\1")
}

I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.

Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:

This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:

$pattern =~ s/(\W)/\\$1/g;

As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):

Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.

which reinforces my point that this solution is only guaranteed for PCRE.

Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159
  • Ryan, I maybe didn't understand well the correct usage of your function, but it fails when I try to scape a regex for remove whitespace: `quotemeta('\s+')`. How can I manage it? – patL Mar 21 '22 at 10:12
18

Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':

gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)

My previous answer:

I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.

re.escape <- function(strings){
    vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)", 
              "\\{", "\\}", "\\^", "\\$","\\*", 
              "\\+", "\\?", "\\.", "\\|")
    replace.vals <- paste0("\\\\", vals)
    for(i in seq_along(vals)){
        strings <- gsub(vals[i], replace.vals[i], strings)
    }
    strings
}

Some output

> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"  
Dason
  • 60,663
  • 9
  • 131
  • 148
  • This is not a good solution. You would have to include every single special regexp character in `vals`, which could get difficult. – Ryan C. Thompson Feb 12 '13 at 20:23
  • @RyanThompson Sure - but it's a start. And the list of special characters is finite so it's not a terribly huge burden. I'm not saying this is an optimal solution - just that it's one possibility. Also note that your method might escape characters that aren't typically considered regex characters so that might be considered 'bad' as well. – Dason Feb 12 '13 at 21:54
  • 1
    My method might escape some characters that don't need to be escaped, but doing so won't hurt, since for PCREs *any* non-alphanumeric character is taken as a literal when prefixed by a backslash, even if the backslash is not needed. – Ryan C. Thompson Feb 12 '13 at 22:12
  • The other fatal flaw in this method is that it successively applies its escapes rather than all at once, so changes made in one pass can get garbled by the next pass. – Ken Williams Sep 02 '16 at 15:45
  • @KenWilliams Can you give an example of where the given answer would fail because of the issue you're bringing up? – Dason Sep 02 '16 at 15:57
  • 2
    You're right, I think it works as expected. I didn't look close enough to notice that the backslash is the first replacement in the list, and since the backslash is also the only character added by the `gsub()`, you'll never insert a character and then act on the insertion. – Ken Williams Sep 02 '16 at 17:58
  • what worked well for me was `gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)`, Ryan's answer did okay but gave many false positives (random example `¤`) – stevec Mar 17 '20 at 12:52
5

An easier way than @ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.

Kalin
  • 1,691
  • 2
  • 16
  • 22
Paul Lemmens
  • 595
  • 5
  • 14
2

Use the rex package

These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:

library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")

But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:

x = "foo[bar]"
y = rex(start, x, end)

Now y is ^foo\[bar\]$ and will only match the exact string contained in x.

Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159
1

According to ?regex:

The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).

Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:

> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"

Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".

antonio
  • 10,629
  • 13
  • 68
  • 136