2

I am reading Boundary Matcher from Oracle Documentation. I understand most of the part, but i am not able to grasp the \b Boundary Matcher. Here is the example from the documentation.

To check if a pattern begins and ends on a word boundary (as opposed to a substring within a longer string), just use \b on either side; for example, \bdog\b

Enter your regex: \bdog\b Enter input string to search: The dog plays in the yard. I found the text "dog" starting at index 4 and ending at index 7.

Enter your regex: \bdog\b Enter input string to search: The doggie plays in the yard. No match found. To match the expression on a non-word boundary, use \B instead:

Enter your regex: \bdog\B Enter input string to search: The dog plays in the yard. No match found.

Enter your regex: \bdog\B Enter input string to search: The doggie plays in the yard. I found the text "dog" starting at index 4 and ending at index 7.

In short, i am not able to understand the working of \b. Can someone help me describing its usage and help me understand this example.

Thanks

benz
  • 4,561
  • 7
  • 37
  • 68
  • 1
    `\b` denotes a word boundary. Saying `\bfoo\b` would match `foo` or `bar foo baz` but not `foobar` or `barfoo`. – devnull Feb 11 '14 at 07:30

5 Answers5

3

\b is what you can call an "anchor": it will match a position in the input text.

More specifically, \b will match every position in the input text where:

  • there is no preceding character and the following character is a word character (any letter or digit, or an underscore);
  • there is no following character and the preceding character is a word character;
  • the preceding character is a word character and the following character is not; or
  • the following character is a word character and the preceding character is not.

For instance, the regex dog\b in the text "my dog eats" will match the position immediately after the g of dog (which is a word character) and before the following space (which is not).

Note that like all anchors, the fact that it matches a position means that it does not consume any input text.

Other anchors are ^, $, lookarounds.

fge
  • 119,121
  • 33
  • 254
  • 329
  • in your point-4, does it means if there is any non character like a space and the following is a word, it qualifies to be a word boundary? – benz Feb 11 '14 at 07:47
  • Yes, it does mean that. For instance, `\b` in `\bdog` will match the position between the space and `d` in "my dog eats". – fge Feb 11 '14 at 07:49
  • or in fact the space between $ and d in $dog – Vogel612 Feb 11 '14 at 07:52
  • @fge thankyou so very much. stackoverflow rocks always. – benz Feb 11 '14 at 08:04
2

The docs don't seem to explain what exactly a word boundary is. Let me try:

\b matches a position between characters (so it doesn't match any text itself, it just asserts that a certain condition is met at the current position in the string). That condition is defined as:

There either is a character of the character set defined by \w (alphanumerics and underscore) before the current position or after the current position, but not both.

The inverse is true for \B - it matches iff \b doesn't match at the current position.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Not quite. In Java, `\w` will only match ASCII letters ;) – fge Feb 11 '14 at 07:43
  • @fge: Not at all. It also matches ASCII digits and the underscore. "Alphanumerics" are defined as ASCII letters and digits. – Tim Pietzcker Feb 11 '14 at 07:44
  • I mean to say that it won't match non ASCII alphanumerics. – fge Feb 11 '14 at 07:45
  • @fge: [Alphanumerics are always ASCII](http://en.wikipedia.org/wiki/Alphanumeric). – Tim Pietzcker Feb 11 '14 at 07:47
  • The point remains anyway: the statement "There either is a character of the character set defined by `\w`" is incorrect ;) – fge Feb 11 '14 at 07:50
  • Also that statement is wrong for German, French, Spanish and all Scandinavian as well as eastern European Languages. American alphanumerics are always ASCII, that's why it's American Standard Code for Information Interchange.. – Vogel612 Feb 11 '14 at 08:00
  • Even in other languages, alphanumerics [do *not* include non-ASCII letters](http://fr.wikipedia.org/wiki/Caract%C3%A8re_alphanum%C3%A9rique). Please provide evidence to the contrary. And `\w` in Java matches ASCII alnums, unless you specify `Pattern.UNICODE_CHARACTER_CLASS` [in which case the definition of `\b` also changes](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions). – Tim Pietzcker Feb 11 '14 at 08:08
0

Simply speaking, \b matches the position between a \w and \W (as in not \w) character, and thus is the end or start of a Word. The end/start of String counts as \W here.

The most common \W characters you may find are:

  • Whitespace
  • Comma
  • Fullstop
  • Special Characters (§,$,%, [...])
  • Not Underscore
  • Anything not ASCII (Umlauts, Cyrillic, Arabic, [...])

\B is just the inverse match of \b

--> It matches the position, that \b does not match (eg. [\w][\w] OR [\W][\W])

You can experiment with java regular expressions here

Vogel612
  • 5,620
  • 5
  • 48
  • 73
  • No, this is not quite correct. `\b` matches a _position_ in the input text, it will never consume a character. – fge Feb 11 '14 at 07:35
  • This is very wrong. `[^\b]` matches a character that is not a backspace control code, your list of characters is very incomplete, and `\b` is zero-length (and not restricted to the ends of words. – Tim Pietzcker Feb 11 '14 at 07:36
  • @TimPietzcker is this better? – Vogel612 Feb 11 '14 at 07:43
  • 1
    Better use `\W` instead of `^\w` to avoid confusion with start-of-string anchors. And `\b` also matches at the start/end of a string, if there's a `\w` character next to it... – Tim Pietzcker Feb 11 '14 at 07:51
0

\b- matches the empty string at the beginning or end of a word.

The metacharacter \b is an anchor like the caret and the dollar sign. 

It matches at a position that is called a "word boundary". This match is zero-length.

\B is opposite of \b

\B matches the empty string not at the beginning or end of a word.
Nambi
  • 11,944
  • 3
  • 37
  • 49
0

For \b, if there is a 'word' char at one side of \b, there must be a not-'word' char at other side.

For \B, if there is a 'word' char at one side, there must be a 'word' char too at other side. If there is a not-'word' char at one side, there must be a not-'word' char too at other side.

The 'word' char are A-Za-z0-9 and _, others are not-word char for C locale.

Sswater Shi
  • 189
  • 8