3

I'm trying a regex lookahead in R with the following command:

 sub(x = street.addresses, pattern = "\\s((?i)Street|(?i)St\\.?)(?=\\sNE)", replacement = " St")

My goal is to replace Street with St where it's followed by a space and the directional NE (as in, "Northeast"). It seems like the lookahead couldn't be more straightforward but I keep hitting an error:

Error in sub(x = streets, pattern = "\\s((?i)Street|(?i)St\\.?)(?=\\sNE)",: 
invalid regular expression '\s((?i)Street|(?i)St\.?)(?=\sNE)', reason 
'Invalid regexp' 

Versions of this without the lookahead work fine in R, but as soon as I add a lookahead of any sort to my search/replace, I hit the error. Likewise, other regex R functions like grep seem to have the same problem.

I've copied/pasted that regex expression into engines like https://regex101.com/ and it seems to work fine there, so I'm confused. Am I missing something basic about regex in R?

EDIT:

Here's a copy direct from my console:

> street.addresses <- c("23 Charles Street NE","23 Charles St. NE")
> new.vec <- sub(x = street.addresses, pattern = "\\s((?i)Street|(?i)St\\.?)
(?=\\sNE)", replacement = " St")
Error in sub(x = street.addresses, pattern = "\\s((?i)Street|(?i)St\\.?)(?
=\\sNE)",  : 
invalid regular expression '\s((?i)Street|(?i)St\.?)(?=\sNE)', reason 
'Invalid regexp'
Brandon
  • 43
  • 4
  • You should include a sample input vector and the appropriate output [to make your example reproducible](https://stackoverflow.com/questions/5963269/ddg#5963610) – alistaire Jan 07 '18 at 03:16

2 Answers2

3

You need to use sub in Perl mode if you want to use a lookahead:

street <- "123 Hudson Street NE, New York, NY"
sub(x = street, pattern = "\\s((?i)Street|(?i)St\\.?)(?=\\sNE)",
    replacement = " St", perl=TRUE)

[1] "123 Hudson St NE, New York, NY"

Demo

By the way, if you put the parameters to sub in their default positions, then you can omit the names, leaving us with a more terse call:

sub("\\s((?i)Street|(?i)St\\.?)(?=\\sNE)", " St", street, perl=TRUE)
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
2

Actually, you do not need a positive lookahead if the blow is what you want:

street = c("2389 E. Myronga Street NE")
sub(x = street, pattern = "\\s((?i)Street|(?i)St\\.?)\\sNE", replacement = " St")

Output:

sub(x = street, pattern = "\s((?i)Street|(?i)St\.?)\sNE", replacement = " St")
1 "2389 E. Myronga St NE"

However, you can use a positive lookaround (and other Perl compatible regex (PCRE) functionality) if you set PERL=TRUE as additional argument

sub(x = street, pattern = "\\s((?i)Street|(?i)St\\.?)(?=\\sNE)", replacement = " St", perl=TRUE)

The reason for this difference is, there are two types of regular expressions used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE (R doc, see also regular-expressions.info/rlanguage).

wp78de
  • 18,207
  • 7
  • 43
  • 71
  • Hmmm, it works but I don't know *how*. How does the function know I don't want to replace \sNE with my replacement string as well as the rest of the matching expression? – Brandon Jan 07 '18 at 03:33
  • It's indeed confusion. I have added more context to explain this. – wp78de Jan 07 '18 at 03:44
  • @Brandon the core of this expression is `\\s(Street|St)\\sNE`. `Street|St` is in a capturing group, and we're just replacing the capture group with "St". `(?i)` just means "case insensitive", so "street/Street" will both be matched. – Mako212 Jan 07 '18 at 03:51