29

If I have a sentence and I wish to display a word or all words after a particular word has been matched ahead of it, for example I would like to display the word fox after brown The quick brown fox jumps over the lazy dog, I know I can look positive look behinds e.g. (?<=brown\s)(\w+) however I don't quite understand the use of \b in the instance (?<=\bbrown\s)(\w+). I am using http://gskinner.com/RegExr/ as my tester.

PeanutsMonkey
  • 6,919
  • 23
  • 73
  • 103

7 Answers7

47

\b is a zero width assertion. That means it does not match a character, it matches a position with one thing on the left side and another thing on the right side.

The word boundary \b matches on a change from a \w (a word character) to a \W a non word character, or from \W to \w

Which characters are included in \w depends on your language. At least there are all ASCII letters, all ASCII numbers and the underscore. If your regex engine supports unicode, it could be that there are all letters and numbers in \w that have the unicode property letter or number.

\W are all characters, that are NOT in \w.

\bbrown\s

will match here

The quick brown fox
         ^^

but not here

The quick bbbbrown fox

because between b and brown is no word boundary, i.e. no change from a non word character to a word character, both characters are included in \w.

If your regex comes to a \b it goes on to the next char, thats the b from brown. Now the \b know's whats on the right side, a word char ==> the b. But now it needs to look back, to let the \b become TRUE, there needs to be a non word character before the b. If there is a space (thats not in \w) then the \b before the b is true. BUT if there is another b then its false and then \bbrown does not match "bbrown"

The regex brown would match both strings "quick brown" and "bbrown", where the regex \bbrown matches only "quick brown" AND NOT "bbrown"

For more details see here on www.regular-expressions.info

trincot
  • 317,000
  • 35
  • 244
  • 286
stema
  • 90,351
  • 20
  • 107
  • 135
  • Thanks. Each time I think I have understood I get more confused. Sorry for being a newbie here but a few questions. What do you mean by `it matches a position with one thing on the left side and another thing on the right side`? I also didn't quite follow your comment in the second paragraph. – PeanutsMonkey Sep 30 '11 at 06:30
  • 1
    @PeanutsMonkey I added two more paragraphs and a useful link, hope it becomes clearer now. – stema Sep 30 '11 at 06:48
  • Thanks the second to last paragraph helped clarify what you meant so to reiterate what you mean and so that I am clear if I have a short sentence such as `quick brown fox and bbrown cat` and use \b it would match brown and display fox but not cat. Did I correctly understand that? – PeanutsMonkey Oct 01 '11 at 23:54
6

The \b token is kind of special. It doesn't actually match a character. What it does is it matches any position that lies at the boundary of a word (where "word" in this case is anything that matches \w). So the pattern (?<=brown\s)(\w+) would match "bbbbrown fox", but (?<=\bbrown\s)(\w+) wouldn't, since the position between "bb" and "brown" is in the middle of a word, not at its boundary.

Lily Ballard
  • 182,031
  • 33
  • 381
  • 347
  • 1
    Thanks. I don't quite follow what you mean by `since the position between "bb" and "brown" is in the middle of a word, not at its boundary`? – PeanutsMonkey Sep 30 '11 at 02:03
  • @PeanutsMonkey: The boundaries of "bbbrown" are before the first "b" or after the "n". In the middle of the word, between the "bb" and the "brown", you would not consider that a boundary and thus `\b` won't match there. – Lily Ballard Sep 30 '11 at 16:58
2

\b is a "word boundary" and is the position between the start or end of a word and then "non-word" characters.

Its main use is to simplify the selection of a whole word to \bbrown\s will match:

^brown brown 99brown _brown

Its more or less equivalent to "\W*" except when "capturing" strings as "\b" matches the start of the word rather than the non-word character preceding or following the word.

James Anderson
  • 27,109
  • 7
  • 50
  • 78
  • What do you mean by `non-word`? – PeanutsMonkey Sep 30 '11 at 02:06
  • A regular expression a "word character" - "\w" is defined as a sequence of [a-zA-Z0-9_] or something similar, "\W" is anything not in the set. "\b" can be thought of as the first or last character in a sequence of "\w" characters. – James Anderson Sep 30 '11 at 02:13
  • So if I just wanted to limit the boundary to just `brown` what would the expression be? Also how is (?<=(brown)\s)(\w+) different to (?<=\bbrown\s)(\w+)? – PeanutsMonkey Sep 30 '11 at 04:56
  • Equivalent to `\W*`? I wouldn't say that. `\b` prevents some matches and permits others without consuming any characters. `\W*` doesn't *prevent* anything. It will consume any non-word characters it happens to find, but it will match *anywhere*. – Alan Moore Sep 30 '11 at 10:02
2

\b is a zero width match of a word boundary.

(Either start of end of a word, where "word" is defined as \w+)

Note: "zero width" means if the \b is within a regex that matches, it does not add any characters to the text captured by that match. ie the regex \bfoo\b when matched will capture just "foo" - although the \b contributed to the way that foo was matched (ie as a whole word), it didn't contribute any characters.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
2

A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. It's equivalent to this:

(?<=\w)(?!\w)|(?=\w)(?<!\w)

...or it's supposed to be. See this question for everything you ever wanted to know about word boundaries. ;)

Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
1

\b guarantees that brown is on a word boundary effectively excluding patterns like

blackandbrown

ennuikiller
  • 46,381
  • 14
  • 112
  • 137
  • 1
    Thanks. When you say word boundry, is it limited to alphabets or does it also include numbers e.g. 123, etc? I take it that if there were spaces in the words black and brown, the word boundry would not work? – PeanutsMonkey Sep 30 '11 at 02:01
1

You don't need a look behind, you can simply use:

(\bbrown\s)(\w+)
xthexder
  • 1,555
  • 10
  • 22