0

Is there any reason why regex patterns have to be represented by strings in R (or, more specifically, stringr)?

This question is not about why I have to escape a dot to match a literal dot with regex, but rather why I have to excape it twice due to the fact it's written as a string in R.

For instance, if I want to convert a string "a.b" into "a,b", I need to match a literal dot, which is \. in regex parlance. However, because the regex pattern is entered as a string in R, I need to add one more layer of escaping, giving us "\\.". Is there any reason why regex patterns cannot be entered directly, e.g. as regex(\.)? Perhaps it will all become second nature to me soon; as a beginner I find this slightly confusing.

dufei
  • 2,166
  • 1
  • 7
  • 18
  • 1
    I wouldn't say a literal dot is . in regex parlance. You also have to escape dots in python, for example. Every regex scheme has special characters that do something in the regex language that have to be escaped if you want to search for their literal value, such as ., *, etc. That's just a part of regular expressions – duckmayr Apr 29 '19 at 09:34
  • 1
    In regex `.` is a special character meaning any character, and you need to escape all special characters if you want their literal meaning hence `\.`. Other e.g. `+` or `*` where you need to escape them as well for their literal meaning. – Pushpesh Kumar Rajwanshi Apr 29 '19 at 09:36
  • Possible duplicate of [Regular Expression to match a dot](https://stackoverflow.com/questions/13989640/regular-expression-to-match-a-dot) – Pushpesh Kumar Rajwanshi Apr 29 '19 at 09:37
  • Thanks! I was not trying to argue about the "true" way to write regex, but rather wondering if there are technical reasons that prevent us from saving time and space by just writing \. instead of "\\." – dufei Apr 29 '19 at 09:38
  • If you don't want to escape `.` and use them as literal, use `str_replace` instead. In regex they had to choose some character to represent any character and that's a dot, so you know. – Pushpesh Kumar Rajwanshi Apr 29 '19 at 09:41
  • Yes. The technical reasons are that certain characters are reserved for special functionality. In regex, it's very useful to have a character that matches **any** other character. That is usually the `.` – duckmayr Apr 29 '19 at 09:41
  • Python also uses strings for regular expressions and needs to escape twice : `"\\."`. Even in JavaScript is done this way if you create a new RegExp: `new RegExp("\\.")` – R. Schifini Apr 29 '19 at 11:54

1 Answers1

1

The basic issue is that regular expressions are handled by functions in R, they aren't a built-in part of the language. Building them in would require a change in the way characters are parsed when reading R code. Since regular expressions aren't central to the language, this is seen as an unnecessary complication.

More specifically, for the R parser to handle regex(\.), you'd need a new reserved word (regex), and a whole new parsing mode to be defined, with its own complications. For example, both "" and ")" are legal regular expressions. (Ignore the quotes, just consider the characters within them.) Putting them in your suggested syntax would look like regex() and regex()), so the R parser would have to look ahead when it hit the first ) to know where the regular expression ended. But "))" is also legal, so how would it know where to stop?

Putting regular expressions into strings adds the extra layer of escapes, but at least it doesn't complicate the design of the parser.

EDITED TO ADD:

As of R 4.0.0, things are better for writing regular expressions because of the new syntax for string literals described in this NEWS entry:

There is a new syntax for specifying raw character constants similar to the one used in C++: r"(...)" with ... any character sequence not containing the sequence )". This makes it easier to write strings that contain backslashes or both single and double quotes. For more details see ?Quotes.

So if you want to enter \., you replace the ... above with exactly what you want, with no escapes necessary:

r"(\.)"

This is parsed the same as "\.". It's not exactly what you wished for, but it's kind of close.

user2554330
  • 37,248
  • 4
  • 43
  • 90