8

I have a string like this:

vect <- c("Thin lines are not great, I am in !!! AND You shouldn't be late OR you loose")

I want to replace, "in" to %in%", "AND" to "&", "OR" to "|".

I know this can be done using gsub like below:

gsub("\\bin\\b","%in%", vect),

but I need three different lines for each of the replacement, hence I choose to use gsubfn.

so I tried,

gsubfn("\\bin\\b|\\bAND\\b|\\bOR\\b", list("in"="%in%", "AND"= "&", "OR"="|"), vect)

but It returns a string with nothing changed, for some reason \\b is not working for the string. However, \\b does work great with gsub and I am able to replace all the three strings in by piping together using gsub.

My question is, why \\b is not working inside gsubfn. what I am missing inside my regex?

Please help.

Output should be:

"Thin lines are not great, I am %in% !!! & You shouldn't be late | you loose"

This works:

gsubfn("\\w+", list("in"="%in%", "AND"= "&", "OR"="|"), vect)
smci
  • 32,567
  • 20
  • 113
  • 146
PKumar
  • 10,971
  • 6
  • 37
  • 52

2 Answers2

6

By default, Tcl regex engine is used, see gsubfn docs:

If the R installation has tcltk capability then the tcl engine is used unless FUN is a proto object or perl=TRUE in which case the "R" engine is used (regardless of the setting of this argument).

So, word boundaries are defined with \y:

> gsubfn("\\y(in|AND|OR)\\y", list("in"="%in%", "AND"= "&", "OR"="|"), vect)
[1] "Thin lines are not great, I am %in% !!! & You shouldn't be late | you loose"

Ainother way is by using \m as leading word boundary and \M for a trailing word boundary:

> gsubfn("\\m(in|AND|OR)\\M", list("in"="%in%", "AND"= "&", "OR"="|"), vect)
[1] "Thin lines are not great, I am %in% !!! & You shouldn't be late | you loose"

You may pass perl=TRUE and use \b:

> gsubfn("\\b(in|AND|OR)\\b", list("in"="%in%", "AND"= "&", "OR"="|"), vect, perl=TRUE)
[1] "Thin lines are not great, I am %in% !!! & You shouldn't be late | you loose"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Any idea why `gsub` works without perl=T condition? Thanks for the help, Your answers are always great – PKumar Dec 16 '17 at 14:09
  • 1
    @PKumar `gsub` uses TRE regex egnine, not Tcl, by default, and the version of TRE for R contains an implementation of both `\b` (just word boundary) and a pair of `\<` (leading word boundary) and `\>` (trailing word boundary). – Wiktor Stribiżew Dec 16 '17 at 14:11
4

Add perl = T that should do it.

gsubfn("\\bin\\b|\\bAND\\b|\\bOR\\b", list("in"="%in%", "AND"= "&", "OR"="|"), vect, perl =T)

Output

[1] "Thin lines are not great, I am %in% !!! & You shouldn't be late | you loose"

From gsub documentation

The POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

And gsubfn documentation

... Other gsub arguments.

Doesn't explain why gsub works fine without the perl argument, but to do gsubfn it needs the perl=T

SamFlynn
  • 369
  • 7
  • 20