Regex expression failing in R

Question

I have come up with the following regular expression, to recognise a gap of multiple strings if its preceded by a gap & pattern before it, and have confirmed that its working in regexr.com

Pattern is:

(?=\s{2,}.\s{2,})\s{2,}

But when i use it in R within grep() it seems to fail? even including escape character notation:

exampleText = "1  Building  Apartment  City"
gsub("\\(?=\\s{2,}.\\s{2,}\\)\\s{2,}",",",exampleText)

Hoping to get the following output:

"1 Building,Apartment,City"

and the regular expression is meant to only match if there is a double space or greater on either side of a string.

Getting the error "invalid regular expression"

If I copy/paste that code, I don't get that error -- i just get an empty result. Does it depend on the value of `exampleText`? Please make sure your example is [reproducible ](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used for testing. — MrFlick, Aug 05 '21 at 05:35
At a guess you want to remove the escapes from the lookahead and add the argument `perl = TRUE` - `grep("(?=\\s{2,}.\\s{2,})\\s{2,}","replace text",exampleText, perl = TRUE)`. — Ritchie Sacramento, Aug 05 '21 at 05:58
Thanks for the feedback, I've updated the question with the values. Hope that helps further explain — PDogg95, Aug 05 '21 at 23:31
`(?=...)` is *lookahead*, which should go after some other pattern; do you mean `(?<=...)` for lookbehind (as well as adding `perl=TRUE`, as has been suggested)? — r2evans, Aug 13 '21 at 12:55

score 0 · Answer 1 · answered Aug 13 '21 at 13:00

R's implemention of PCRE (for lookahead/lookbehind) does not allow for variable reptition quantifiers (e.g., {2,}); from ?regex:

     Patterns '(?=...)' and '(?!...)' are zero-width positive and
     negative lookahead _assertions_: they match if an attempt to match
     the '...' forward from the current position would succeed (or
     not), but use up no characters in the string being processed.
     Patterns '(?<=...)' and '(?<!...)' are the lookbehind equivalents:
     they do not allow repetition quantifiers nor '\C' in '...'.

(the last line). For instance, we'll see:

gsub("(?<=\\s{2,})Quux", ",", exampleText, perl=TRUE)
# Warning in gsub("(?<=\\s{2,})Quux", ",", exampleText, perl = TRUE) :
#   PCRE pattern compilation error
#   'lookbehind assertion is not fixed length'
#   at '(?<=\s{2,})Quux'

but no such error if we change to "(?<=\\s{2})". As such, your lookaround expressions need to be fixed-width.

Some suggestions, both of these produce the desired results:

txt <- gsub("(?<=\\s{2})(\\S*)(?=\\s{2})", "\\1,", exampleText, perl=TRUE)
txt <- gsub("(?<=\\s\\s)(\\S*)(?=\\s\\s)", "\\1,", exampleText, perl=TRUE)
txt
# [1] "1  Building,  Apartment,  City"

You can fix the multi-spaces with a couple more patterns, if needed:

gsub("\\s+", " ", gsub(",  ", ",", txt))
# [1] "1 Building,Apartment,City"

Since it looks as if you are creating comma-delimited text, though, most readers will optionally discard the surrounding blankspace:

txt
# [1] "1  Building,  Apartment,  City"
str(read.csv(text = txt, header = FALSE, strip.white = TRUE))
# 'data.frame': 1 obs. of  3 variables:
#  $ V1: chr "1  Building"
#  $ V2: chr "Apartment"
#  $ V3: chr "City"

Regex expression failing in R

1 Answers1