-1

For regex (456)\b and input 123456 xyz it works as expected and the output is 456. Case 1..

For almost the same regex (456)#\b and input 123456# xyz I expected the output to be 456#. Because \b should still match the end of the line after matching #.

But the regex engine failed to find a match. Case 2.

Strangely, it works for the regex (456)#\B. Notice the non-word boundary \B in this regex. Case 3. What does \B match here?

I went through This answer for understanding \b and \B and seems like my understanding is right.

So why is it strange? What am I missing here? Why does \B work while \b doesn't in case 2 and case 3?

Arun Gowda
  • 2,721
  • 5
  • 29
  • 50

3 Answers3

1

A word character is a character from a-z, A-Z, 0-9, including the _ (underscore) character.

So the # is not a word character, so it is not followed by a word boundary

AleksW
  • 703
  • 3
  • 12
  • but `\b` is also supposed to match word boundary. which in this case is a space. – Arun Gowda May 21 '19 at 12:36
  • 1
    A word boundary is where a 'word' ends, and a 'non-word' will begin, a space is just one of many non-word characters So where you have 456#, the word boundary is between 6 and #, as 6 is a word, and # is not – AleksW May 21 '19 at 12:37
1

A word boundary asserts the position using the following regex - (^\w|\w$|\W\w|\w\W). A word here is anything in [a-zA-Z0-9_]

So in your case, for the regex (456)#\b, trying to match the string 123456# xyz will fail since # and the space after it are BOTH non- words(there needs to be one word and one non-word for a boundary) and thereby not satisfying the above regex.

Amusingly, if you try adding a word after the # in the string, say 123456#b xyz, it'll match, like shown here

Kamehameha
  • 5,423
  • 1
  • 23
  • 28
  • Can you define `\B` like this `(^\w|\w$|\W\w|\w\W)` ? – Arun Gowda May 21 '19 at 16:17
  • Nope. That is \b. \B is the direct negation of \b. So whatever is not `\b`, will be \B. You can refer the explanation box on the right side of regex101.com for any regex to understand what exactly are the constituents of these anchors tokens. – Kamehameha May 22 '19 at 07:31
0

A word boundary \b is defined as the point between a word and non word character. Assuming the standard C locale then # and space are both non word characters so there is no word boundary between them.

JGNI
  • 3,933
  • 11
  • 21