0

I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:

grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.

grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.

I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?

2 Answers2

0

grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:

str_extract("The cat scattered his food all over the room.", "\\bcat\\b") 
[1] "cat"

The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:

str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B") 
[[1]]
[1] "cat" "cat"

These two matches are from education and scattered.

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.

x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"

x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"

For more than 1 match use gregexpr:

x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
GKi
  • 37,245
  • 2
  • 26
  • 48
  • I like that your answer uses R base. I had another example where the pattern occurs multiple times in the string. In `stringr` we can do `str_extract_all("1abc2", "[0-9]")` for multiple occurance (will return 1 *and* 2). But `regmatches("1abc2", regexpr("[0-9]", "1abc2"))` returns only 1. Is there a way to do this with your approach? –  Apr 22 '20 at 07:51
  • Yes: Use `gregexpr` instead of `regexpr`. I added it in the answer. – GKi Apr 22 '20 at 07:54