100

I want to match a regular expression special character, \^$.?*|+()[{. I tried:

x <- "a[b"
grepl("[", x)
## Error: invalid regular expression '[', reason 'Missing ']''

(Equivalently stringr::str_detect(x, "[") or stringi::stri_detect_regex(x, "[").)

Doubling the value to escape it doesn't work:

grepl("[[", x)
## Error: invalid regular expression '[[', reason 'Missing ']''

Neither does using a backslash:

grepl("\[", x)
## Error: '\[' is an unrecognized escape in character string starting ""\["

How do I match special characters?


Some special cases of this in questions that are old and well written enough for it to be cheeky to close as duplicates of this:
Escaped Periods In R Regular Expressions
How to escape a question mark in R?
escaping pipe ("|") in a regex

k-dubs
  • 29
  • 7
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360

3 Answers3

138

Escape with a double backslash

R treats backslashes as escape values for character constants. (... and so do regular expressions. Hence the need for two backslashes when supplying a character argument for a pattern. The first one isn't actually a character, but rather it makes the second one into a character.) You can see how they are processed using cat.

y <- "double quote: \", tab: \t, newline: \n, unicode point: \u20AC"
print(y)
## [1] "double quote: \", tab: \t, newline: \n, unicode point: €"
cat(y)
## double quote: ", tab:    , newline: 
## , unicode point: €

Further reading: Escaping a backslash with a backslash in R produces 2 backslashes in a string, not 1

To use special characters in a regular expression the simplest method is usually to escape them with a backslash, but as noted above, the backslash itself needs to be escaped.

grepl("\\[", "a[b")
## [1] TRUE

To match backslashes, you need to double escape, resulting in four backslashes.

grepl("\\\\", c("a\\b", "a\nb"))
## [1]  TRUE FALSE

The rebus package contains constants for each of the special characters to save you mistyping slashes.

library(rebus)
OPEN_BRACKET
## [1] "\\["
BACKSLASH
## [1] "\\\\"

For more examples see:

?SpecialCharacters

Your problem can be solved this way:

library(rebus)
grepl(OPEN_BRACKET, "a[b")

Form a character class

You can also wrap the special characters in square brackets to form a character class.

grepl("[?]", "a?b")
## [1] TRUE

Two of the special characters have special meaning inside character classes: \ and ^.

Backslash still needs to be escaped even if it is inside a character class.

grepl("[\\\\]", c("a\\b", "a\nb"))
## [1]  TRUE FALSE

Caret only needs to be escaped if it is directly after the opening square bracket.

grepl("[ ^]", "a^b")  # matches spaces as well.
## [1] TRUE
grepl("[\\^]", "a^b") 
## [1] TRUE

rebus also lets you form a character class.

char_class("?")
## <regex> [?]

Use a pre-existing character class

If you want to match all punctuation, you can use the [:punct:] character class.

grepl("[[:punct:]]", c("//", "[", "(", "{", "?", "^", "$"))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

stringi maps this to the Unicode General Category for punctuation, so its behaviour is slightly different.

stri_detect_regex(c("//", "[", "(", "{", "?", "^", "$"), "[[:punct:]]")
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

You can also use the cross-platform syntax for accessing a UGC.

stri_detect_regex(c("//", "[", "(", "{", "?", "^", "$"), "\\p{P}")
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

Use \Q \E escapes

Placing characters between \\Q and \\E makes the regular expression engine treat them literally rather than as regular expressions.

grepl("\\Q.\\E", "a.b")
## [1] TRUE

rebus lets you write literal blocks of regular expressions.

literal(".")
## <regex> \Q.\E

Don't use regular expressions

Regular expressions are not always the answer. If you want to match a fixed string then you can do, for example:

grepl("[", "a[b", fixed = TRUE)
stringr::str_detect("a[b", fixed("["))
stringi::stri_detect_fixed("a[b", "[")
epo3
  • 2,991
  • 2
  • 33
  • 60
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
0

I think the easiest way to match the characters like

\^$.?*|+()[

are using character classes from within R. Consider the following to clean column headers from a data file, which could contain spaces, and punctuation characters:

> library(stringr)
> colnames(order_table) <- str_replace_all(colnames(order_table),"[:punct:]|[:space:]","")

This approach allows us to string character classes to match punctation characters, in addition to whitespace characters, something you would normally have to escape with \\ to detect. You can learn more about the character classes at this cheatsheet below, and you can also type in ?regexp to see more info about this.

https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

petergensler
  • 342
  • 2
  • 8
  • 23
-1

If you have a vector with values containing special regex metacharacters and you need to create an alternation from the vector, you need to escape the values automatically with

regex.escape <- function(string) {
    gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
}
x <- c("a[b", "c++", "d()e")
regex <- paste(regex.escape(x), collapse="|")
## => a\[b|c\+\+|d\(\)e

And mind - if you use an extracting base R regex method like regmatches/gregexpr/regexec/etc. - that TRE regex flavor, being a POSIX regex engine, always returns the longest match (i.e. all alternatives are checked for and the longest match is returned).

If you use base R regex functions with perl=TRUE or stringr/stringi ICU regex functions, you should read the abstracts below.

Note that in cases the regex you build has nothing on its sides, you will most probably also want to sort the values by length in descending order first, because regular expression engines search for matches from left to right, and user-defined lists tend to contain items that may match at the same location inside the string (=values inside vectors may start with the same character(s), and longer matches may be lost, see Remember That The Regex Engine Is Eager):

sort.by.length.desc <- function (v) v[order( -nchar(v)) ]

So, in case you have x <- c("a[b", "c++", "d()e", "d()ee"), you can just use

x <- c("a[b", "c++", "d()e", "d()ee")
regex <- paste(regex.escape(sort.by.length.desc(x)), collapse="|")
# => d\(\)ee|d\(\)e|a\[b|c\+\+

Note the d\(\)ee precedes \d\(\)e.

Using a group of alternatives in the middle/start/end of a longer regex

You need to group the alternatives using any kind of group, non-capturing one if you do not need to access the group value, or a capturing group if you need to access the value. Example using unambiguous word boundaries:

x <- c("a[b", "c++", "d()e", "d()ee")
text <- "aaaa[b,abc++,d()e,d()ee"
regex <- paste0("(?!\\B\\w)(?:", paste(regex.escape(sort.by.length.desc(x)), collapse="|"), ")(?<!\\w\\B)")
## -> (?!\B\w)(?:d\(\)ee|d\(\)e|a\[b|c\+\+)(?<!\w\B) 
unlist(regmatches(text,gregexpr(regex, text, perl=TRUE)))
## => [1] "d()e"  "d()ee"

You can notice that the pattern now looks like (?!\B\w)(?: + your alternations + )(?<!\w\B), where the alternation are placed into a non-capturing group ((?:d\(\)ee|d\(\)e|a\[b|c\+\+)), the (?!\B\w) requires a word boundary if the next character is a word character, and the (?<!\w\B) part requires a word boundary if the character immediately on the left is a word character.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563