If your strings always only contain a single space between words, you may use
gsub("(?<!\\bof\\s)the\\b", "of the", vec, perl=TRUE)
library(stringr)
str_replace_all(vec, "(?<!\\bof\\s)the\\b", "of the")
See the regex demo. The the
whole word is replaced with of the
only if the
is NOT preceded with a whole word of
and one single whitespace after it immediately before the
.
However, there are a lot of scenarios when users type more than one space between words.
Hence, a more generic solution is
> gsub("\\bof the\\b(*SKIP)(?!)|\\bthe\\b", "of the", vec, perl=TRUE)
[1] "time of the day" "word of the day" "time of the day"
See the regex demo and the R demo online.
Details:
\bof the\b
- matches of the
as whole words
(*SKIP)(?!)
- skips the match and the regex engine goes on to search for the next match from the failure position
|
- or
\bthe\b
- matches the
whole word in any other context.
If the whitespaces between of
and the
are not limitless, say 1 to 100, you can use a stringr
based solution like
library(stringr)
vec <- c("time of the day", "word of the day", "time the day")
str_replace_all(vec, "\\b(?<!\\bof\\s{1,100})the\\b", "of the")
## => [1] "time of the day" "word of the day" "time of the day"
See this online R demo. ICU regex flavor that is used in stringr
regex fnctions allows the use of limiting quantifiers in the lookbehind patterns.
See this regex demo (used the Java 8 option online as it also supports contrained-width lookbehind patterns.). Details:
\b
- a word boundary
(?<!\bof\s{1,100})
- a negative lookbehind that fails the match if there is a whole word of
followed with one to 100 whitespace chars immediately before the current location
the
- a the
string
\b
- a word boundary.