1

I have a string, in which I'm trying to replace the first matching pattern with a corresponding replacement. EG in my example below : if bb is found first, replace it by foo and don't replace anything else, but if cc is found first, replace it by bar and don't replace anything else.

This behaves almost as desired, except the replacement argument is not interpreted as a regex, but as a whole string. (But the pattern argument is seen as a regex, as required).

stri_replace_first_regex(
  c(" bb cc bb cc "," cc bb cc bb ", " aa bb cc "), 
  pattern = " bb | cc ", 
  replacement = " foo | bar ")

Ouputs : " foo | bar cc bb cc " " foo | bar bb cc bb " " aa foo | bar cc "

while I want it to output " foo cc bb cc " " bar bb cc bb" " aa foo cc "

Any idea on how to solve that ?

Thanks.

More context :

My inputs can have basically almost any formatting, they are postal adresses entered by customers, in which I need to replace the type of street by something standardized (for instance, turn street into st, road in rd and avenue in av). Any of those words can appear again (eg 20 bis road of sesame street), so I consider only the first appearance as valid, and the subsequent appearances of a word from the pattern list must not be replaced.

François M.
  • 4,027
  • 11
  • 30
  • 81
  • 1
    A replacement pattern cannot contain a regex pattern. Is it a literal `bb` and `cc` or just pattern placeholders? I guess they are just placeholders here. – Wiktor Stribiżew May 30 '16 at 14:34
  • I'm not sure I understand your question : I have postal adresses, and I want to replace the word indicating the type of street by something standardized : `1 road of whatever road` to `1 rd of whatever road`, `1 street of whatever street` to `1 st of whatever street`, and `1 street of whatever road` to `1 st of whatever road`. So my two regexes would be `pattern = " street | road "` and `replacement = " st | rd "`. I hope this answers your question. – François M. May 30 '16 at 14:42
  • 2
    `v <- Vectorize(sub); v(c('bb', 'cc'), c('foo', 'bar'), c(" bb cc bb cc "," cc bb cc bb "))` – rawr May 30 '16 at 14:57

2 Answers2

3

You can use qdap library's mgsub for these replacements:

> input <- c("1 road of whatever road", "1 street of whatever street")
> pattern = c("^(.*?)\\bstreet\\b","^(.*?)\\broad\\b")
> replacement = c("\\1st","\\1rd")
> mgsub(pattern, replacement, input, fixed=FALSE, perl=TRUE)
[1] "1 rd of whatever road"   "1 st of whatever street"

The patterns include ^ (start of string), (.*?) a capturing group matching any characters but a newline as few as possible up to the first occurrence of the whole words (due to the word boundaries \b) street and road.

The replacement patterns have backreferences (\\1) to the text captured with the capturing groups and the words to replace.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Installing the package right now to try this. However, if I trust your output, this replaces everything and not just the first word : `"1 rd of whatever RD" "1 st of whatever ST"` while I need `"1 rd of whatever ROAD" "1 st of whatever STREET"` (upper case is for emphasis only) – François M. May 30 '16 at 14:57
  • @fmalaussena there is probably an `msub` function in that package, use that instead – rawr May 30 '16 at 14:58
  • I updated the question to show how to replace the first whole word only. NOTE that in case your strings can contain newlines, add `(?s)` at the beginning of the patterns: `pattern = c("(?s)^(.*?)\\bstreet\\b","(?s)^(.*?)\\broad\\b")` – Wiktor Stribiżew May 30 '16 at 15:06
  • I guess this would work, but unfortunately, there seems to be a problem with Java and/or rJava, so I can't get qdap to install... Thanks anyway. – François M. May 30 '16 at 17:05
  • Check http://stackoverflow.com/questions/3311940/r-rjava-package-install-failing, also https://github.com/trinker/qdap/issues/186. – Wiktor Stribiżew May 30 '16 at 17:10
0

Read ?stringi::stri_replace_first_regex; pattern and replacement are vectorized, so if you pass them a vector of strings, each pattern will be replaced with the respective replacement:

stringi::stri_replace_first_regex(
    c(" bb cc bb cc "," cc bb cc bb "), 
    pattern = c("bb", "cc"), 
    replacement = c("foo", "bar"))
# [1] " foo cc bb cc " " bar bb cc bb "
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Try `stri_replace_first_regex( c("cc bb cc "," cc bb cc bb "), pattern = c("bb", "cc"), replacement = c("foo", "bar"))` and you will see that it doesn't behave as desired, as the output is `" cc foo cc " " bar bb cc bb " ` and not `" foo bb cc " " bar bb cc bb"` – François M. May 30 '16 at 15:11
  • Well yeah, it iterates over the patterns, because if it tries to match both strings at once they may overlap. You could integrate the start of line `^` into your regex if you know where it is, or this is going to be a pain. More context may illuminate more options, though. – alistaire May 30 '16 at 15:20
  • BTW: consider using the `vectorize_all` argument of `stringi::stri_replace_all_regex ` – gagolews Jun 01 '16 at 09:03