1

I'm trying to learn regex to use it in r.

Currently I'm just testing a few text substitution operations, and I looked at some example on the internet. Then I tried out the operations below:

Making a list of some random words to test out regex operations

mylist <- c("Calendar", "Vinegar", "Character", "Boiler", "Conductor", "Franchisor")

Trying to match the "or" in those words and replace them with "ee" - using the matching expression "^([a-zA-Z]*)or", and replacing the matched result with "\1ee", but it doesn't work:

sub("^([a-zA-Z]*)or","\1ee", mylist) [1] "Calendar" "Vinegar" "Character" "Boiler" "\001ee" "\001ee"

Trying to match the "or" in those words and replace them with "ee" - using the matching expression "^([a-zA-Z]*)or", and replacing the matched result with "\1ee", that gives the expected result:

sub("^([a-zA-Z]*)or","\1ee", mylist) [1] "Calendar" "Vinegar" "Character" "Boiler" "Conductee" "Franchisee"

My question is why do we have to use "\1" to get backreferencing to work correctly? Isn't backreferencing in regex is normally called with a single slash "\" rather than a "\"?

I sort of guess from reading some sample codes and examples on the internet that in r when you want to use the slash "\" character, you have to specify it as "\". Is that a right application / interpretation in this case?

But doesn't r already recognise "\n" and "\t" as special escaped characters? we can use them straight in a string without any issue, so why not "\1"?

Does that have anything to do with the fact that "^([a-zA-Z]*)or" and "\1ee" are specified as 2 separate arguments of the function sub? How is the function sub specified in r?

Also, a call to:

sub("^([a-zA-Z]*)or","\1ee", mylist)

produces

[1] "Calendar" "Vinegar" "Character" "Boiler" "\001ee" "\001ee"

How come it produces that "\001ee"? Why did "\1" come out as "\001" if r was treating it as a straight text expression? Does "\1" have any special meaning in r?

[Edit] Thanks Wiktor for explaining the requirement for the literal "\". But can anyone please also explain the other questions in my post? That's why it not an exact duplicate of the "how-to-escape-backslashes-in-r-string" topic.

J Henkinson
  • 133
  • 1
  • 1
  • 8
  • When you define a backreference the ``\`` must be a *literal* ``\``. A literal ``\`` is defined in a string literal with double backslashes. If you use a single one, `"\1"`, the value is a char with an octal value of 1, not a backreference at all. – Wiktor Stribiżew Mar 31 '17 at 06:54
  • thanks for the comment Wiktor, and thanks for the link too, but I just came from that exact discussion that you are pointing to. Still it doesn't answer my other questions though, especially the part where why r returned "\001", I find that quite confusing. – J Henkinson Mar 31 '17 at 06:58
  • So, there was no point asking. Read about [string literals](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html). – Wiktor Stribiżew Mar 31 '17 at 06:59
  • sorry, I accidentally pressed entered on the comment before I finished it – J Henkinson Mar 31 '17 at 07:00
  • I understand, but the main point of your question is usage of backreferences in a regex. To use a backreference in a regex, you must use a literal backslash and a number denoting the capturing group ID. The `\001` part is not related to the problem, in fact. – Wiktor Stribiżew Mar 31 '17 at 07:07
  • uhmmm, so my question was too vague in scope you think? should I start a question that ask about that "\001" thing and tag it with "r" only rather than regex? – J Henkinson Mar 31 '17 at 07:10
  • I am sure there are other answers here on SO that already dwell upon that. See [this one](http://stackoverflow.com/questions/19333754/print-backslash-in-r-strings), or [this](http://stackoverflow.com/questions/14185287/escaping-in-string-or-paths-in-r). – Wiktor Stribiżew Mar 31 '17 at 08:57

0 Answers0