-2

What's the most elegant way to extract the keywords in a sentence of string?

I have a list of keywords from a CSV, and i want to predict exact match with keywords which is present in the string.

sapply(keywords, regexpr, String, ignore.case=FALSE) 

I used the above code, but it gives approximate match too.

Tino
  • 2,091
  • 13
  • 15
RAAAAM
  • 3,378
  • 19
  • 59
  • 108
  • 1
    Please read [how to make a great R example](https://stackoverflow.com/q/5963269/3250126) – loki Dec 20 '17 at 14:21
  • 2
    since you did not provide any example you make it very hard to help you. But I think if you want an exact match you need to make your `pattern` more restrictive e.g. using the `\\b` boundary in front and after the word. – Andre Elrico Dec 20 '17 at 14:22

1 Answers1

0

The problem is that a regular expressions don't know what 'words' are - they look for patterns. Consider the following keywords and string.

keywords = c("bad","good","sad","mad")
string = "Some good people live in the badlands which is maddeningly close to the sad harbor."

Here "bad" matches "badlands" because the pattern of "bad" is found in the first three characters. Same with "mad" and "maddeningly".

sapply(keywords, regexpr, string, ignore.case=FALSE)
#> bad good  sad  mad 
#>  30    6   73   48 

So, we need to modify the pattern to make it detect what we really want. The problem is knowing what we really want. If we want a distinct word, then we can add boundaries around the keywords. As Andre noted in the comments, the \b in regex is a word boundary.

sapply(paste("\\b",keywords,"\\b",sep=""), regexpr, string, ignore.case=FALSE)
#> \\bbad\\b \\bgood\\b  \\bsad\\b  \\bmad\\b 
#>        -1          6         73         -1

Note, what I did was use the paste function to stick an escaped \b before and after each keyword. This returns a no-match code for 'bad' and 'mad' but finds the whole word versions of 'good' and 'sad'.

If you wanted to find hyphenated characters, you'd need to modify the boundary matching portion of the expression.

Mark
  • 4,387
  • 2
  • 28
  • 48