1

Let's take the following string:

x <- " hello world"

I would like to extract the first word. To do so, I am using the following regex ^\\W*([a-zA-Z]+).* with a back-reference to the first group.

> gsub("^\\W*([a-zA-Z]+).*", "\\1", x)
[1] "hello"

It works as expected.

Now, let's add a digit and underscore to our string:

x <- " 0_hello world"

I replace \\W by [\\W_0-9] to match the new characters.

> gsub("^[\\W_0-9]*([a-zA-Z]+).*", "\\1", x)
[1] " 0_hello world"

Now, it doesn't work and I do not understand why. It seems that the problem arises when putting \\W within [] but I am not sure why. The regex works on online regex tester using PCRE though.

What am I doing wrong?

Junitar
  • 905
  • 6
  • 13
  • 1
    Try with `sub("^[ 0-9]*([a-zA-Z]+).*", "\\1", x,)` – akrun Mar 03 '19 at 11:50
  • @akrun, the reason why I used the W metacharacter instead of a simple space is because I am dealing with many strings starting with various characters including punctuation marks, space, digits and underscore. – Junitar Mar 03 '19 at 11:57
  • 1
    use the `perl = TRUE` i.e. `sub("^[\\W_0-9]*([a-zA-Z]+).*", "\\1", x, perl = TRUE)` Here, I am using `sub` because we are only matching this once – akrun Mar 03 '19 at 11:59
  • @mt1022 `perl=TRUE` fixed the problem. Can you make it as an answer so I can accept it? Thanks. – Junitar Mar 03 '19 at 11:59
  • Adding `perl=TRUE` is not a full solution. You need to add `(?s)` before the regex since in TRE, `.` matches any char while in PCRE it matches any char but line break chars. – Wiktor Stribiżew Mar 03 '19 at 14:47
  • @Wiktor Stribiżew Thanks for the additional info. – Junitar Mar 03 '19 at 17:16

1 Answers1

0

The quick solution is to use Perl-like Regular Expressions by adding an additional argument perl = TRUE.

By default, grep use Extended Regular Expressions (see ?regex) where character classes are defined in the format of [:xxx:]. However, I could not find a character class to match \W exactly.

mt1022
  • 16,834
  • 5
  • 48
  • 71