0

I am trying to match a large set of words to a column of strings. These words have to have an exact match.

I can do it for a single word at a time but for multiple words I having some issues.

x = c("red", "redish", "green", "greenish")
grepl("red|green", ignore.case=TRUE, x)

I would like for this to return "red" and "green"; but not redish or greenish.

sanjeev
  • 13
  • 3

2 Answers2

5

Regex let's you use \\b for word boundaries:

grepl("\\bred\\b|\\bgreen\\b", x, ignore.case = TRUE)
# [1]  TRUE FALSE  TRUE FALSE

This will work well if you want to match the words within longer strings:

grepl("\\bred\\b|\\bgreen\\b",
      c("I want to match red", "But not Fred",
        "Green yes please", "ignore wintergreen"),
      ignore.case=TRUE)
# [1]  TRUE FALSE  TRUE FALSE

However, if you're doing whole string matching, regex is overkill, equality matching will be much faster:

tolower(x) %in% c("red", "green")
[1]  TRUE FALSE  TRUE FALSE

If we start with patterns = c("red|green") we can get to either of the needed cases above:

## use this with `%in%`
individual_words = unlist(strsplit(patterns, split = "\\|")) 

## or paste on the word boundaries for regex
word_boundary_patterns = paste0("\\b", individual_words, "\\b", collapse = "|")
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thanks Gregor, I am sorry I should have mentioned, that the list of words I have is very large. Could we add \\b...\\b or ^...$ within the code. I am having trouble adding those to the list of words separated by |, like "red|green". Thanks. – sanjeev May 20 '19 at 18:06
  • Thank you!! That helps. – sanjeev May 20 '19 at 18:11
  • There are some special cases that I have to deal with that has "-" and I need them to be read as a single word, that is, all words within a quote, like 'spring-concert' etc. In the example below I need a TRUE only for the last term -- x = c('5-year-old', '17-year-old', '7-year-old', 'year-old') grepl("\\byear-old\\b", ignore.case=TRUE, x) . Thanks for your help – sanjeev May 29 '19 at 04:16
  • Use the pattern `(?<!-)\\byear-old\\b` for that, where `(?<!-)` means "not preceded by a dash". You can [see here](https://stackoverflow.com/a/1324756/903061) how word boundaries are defined. Special cases like this one will have to be handled specially. – Gregor Thomas May 29 '19 at 13:15
1

You can also use ^ and $ to specify for the beginning and end of strings, respectively:

grepl("^red$|^green$", ignore.case = T, x)
Felix T.
  • 520
  • 3
  • 11