The problem is that a regular expressions don't know what 'words' are - they look for patterns. Consider the following keywords and string.
keywords = c("bad","good","sad","mad")
string = "Some good people live in the badlands which is maddeningly close to the sad harbor."
Here "bad" matches "badlands" because the pattern of "bad" is found in the first three characters. Same with "mad" and "maddeningly".
sapply(keywords, regexpr, string, ignore.case=FALSE)
#> bad good sad mad
#> 30 6 73 48
So, we need to modify the pattern to make it detect what we really want. The problem is knowing what we really want. If we want a distinct word, then we can add boundaries around the keywords. As Andre noted in the comments, the \b
in regex is a word boundary.
sapply(paste("\\b",keywords,"\\b",sep=""), regexpr, string, ignore.case=FALSE)
#> \\bbad\\b \\bgood\\b \\bsad\\b \\bmad\\b
#> -1 6 73 -1
Note, what I did was use the paste
function to stick an escaped \b
before and after each keyword. This returns a no-match code for 'bad' and 'mad' but finds the whole word versions of 'good' and 'sad'.
If you wanted to find hyphenated characters, you'd need to modify the boundary matching portion of the expression.