R - regex: W metacharacter not working when within square brackets

Question

Let's take the following string:

x <- " hello world"

I would like to extract the first word. To do so, I am using the following regex ^\\W*([a-zA-Z]+).* with a back-reference to the first group.

> gsub("^\\W*([a-zA-Z]+).*", "\\1", x)
[1] "hello"

It works as expected.

Now, let's add a digit and underscore to our string:

x <- " 0_hello world"

I replace \\W by [\\W_0-9] to match the new characters.

> gsub("^[\\W_0-9]*([a-zA-Z]+).*", "\\1", x)
[1] " 0_hello world"

Now, it doesn't work and I do not understand why. It seems that the problem arises when putting \\W within [] but I am not sure why. The regex works on online regex tester using PCRE though.

What am I doing wrong?

@akrun, the reason why I used the W metacharacter instead of a simple space is because I am dealing with many strings starting with various characters including punctuation marks, space, digits and underscore. — Junitar, Mar 03 '19 at 11:57
use the `perl = TRUE` i.e. `sub("^[\\W_0-9]*([a-zA-Z]+).*", "\\1", x, perl = TRUE)` Here, I am using `sub` because we are only matching this once — akrun, Mar 03 '19 at 11:59
@mt1022 `perl=TRUE` fixed the problem. Can you make it as an answer so I can accept it? Thanks. — Junitar, Mar 03 '19 at 11:59
Adding `perl=TRUE` is not a full solution. You need to add `(?s)` before the regex since in TRE, `.` matches any char while in PCRE it matches any char but line break chars. — Wiktor Stribiżew, Mar 03 '19 at 14:47

score 0 · Accepted Answer · answered Mar 03 '19 at 12:09

The quick solution is to use Perl-like Regular Expressions by adding an additional argument perl = TRUE.

By default, grep use Extended Regular Expressions (see ?regex) where character classes are defined in the format of [:xxx:]. However, I could not find a character class to match \W exactly.

R - regex: W metacharacter not working when within square brackets

1 Answers1