1

How can I replace a word unless it immediately follows another word please. For example, for the vector vec below, how to replace the the in the third element with of the.

The rule for the example below is to: replace 'the' unless it comes immediately after 'of'

vec <- c("time of the day", "word of the day", "time the day")

# This also replaces the 'the' when following 'of'
gsub("the", "of the", vec)
# "time of of the day" "word of of the day" "time of the day" 

The expected outcome is c("time of the day", "word of the day", "time of the day")

user2957945
  • 2,353
  • 2
  • 21
  • 40
  • See https://stackoverflow.com/questions/29639562/r-skipfail-for-multiple-patterns https://stackoverflow.com/questions/47287204/ignore-part-of-a-string-when-splitting-using-regular-expression-in-r – The fourth bird Jun 18 '21 at 11:35
  • 2
    `sub('(?<!of) the', ' of the', vec, perl = TRUE)` ? – Ronak Shah Jun 18 '21 at 11:37

1 Answers1

3

If your strings always only contain a single space between words, you may use

gsub("(?<!\\bof\\s)the\\b", "of the", vec, perl=TRUE)
library(stringr)
str_replace_all(vec, "(?<!\\bof\\s)the\\b", "of the")

See the regex demo. The the whole word is replaced with of the only if the is NOT preceded with a whole word of and one single whitespace after it immediately before the.

However, there are a lot of scenarios when users type more than one space between words.

Hence, a more generic solution is

> gsub("\\bof the\\b(*SKIP)(?!)|\\bthe\\b", "of the", vec, perl=TRUE)
[1] "time of the day" "word of the day" "time of the day"

See the regex demo and the R demo online.

Details:

  • \bof the\b - matches of the as whole words
  • (*SKIP)(?!) - skips the match and the regex engine goes on to search for the next match from the failure position
  • | - or
  • \bthe\b - matches the whole word in any other context.

If the whitespaces between of and the are not limitless, say 1 to 100, you can use a stringr based solution like

library(stringr)
vec <- c("time of the day", "word of the day", "time the day")
str_replace_all(vec, "\\b(?<!\\bof\\s{1,100})the\\b", "of the")
## => [1] "time of the day" "word of the day" "time of the day"

See this online R demo. ICU regex flavor that is used in stringr regex fnctions allows the use of limiting quantifiers in the lookbehind patterns.

See this regex demo (used the Java 8 option online as it also supports contrained-width lookbehind patterns.). Details:

  • \b - a word boundary
  • (?<!\bof\s{1,100}) - a negative lookbehind that fails the match if there is a whole word of followed with one to 100 whitespace chars immediately before the current location
  • the - a the string
  • \b - a word boundary.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563