What is word boundary while using regex in python

Question

What is a word boundary in a Python regex? Can someone please explain this on these examples:

>>> x = '456one two three123'
>>> y=re.search(r"\btwo\b",x)
>>> y
<_sre.SRE_Match object at 0x2aaaaab47d30>

>>> y=re.search(r"two",x)
>>> y
<_sre.SRE_Match object at 0x2aaaaab47d30>

>>> ip="192.168.254.1234"
>>> if re.search(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",ip):
...    print ip
...

>>> ip="192.168.254.1234"
>>> if re.search(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",ip):
...    print ip
192.168.254.1234

The documentation has the answer: http://docs.python.org/library/re.html#regular-expression-syntax — David Heffernan, Apr 13 '12 at 08:58
Instead of us explaining how four examples are working, why don't you ask about what you don't understand? For example what output where you expecting and what instead come out? — Rik Poggi, Apr 13 '12 at 08:58
I want to know why \b is required....If i do not give the examples every one comment that u have not tried,if i give examples some person asks "why don't you ask about what you don't understand?" :) Distributed set of people looking at the posts :) — Rajeev, Apr 13 '12 at 09:08
If I put `regex \b` into Google, I get http://www.regular-expressions.info/wordboundaries.html as the first result. — Karl Knechtel, Apr 13 '12 at 09:16

score 14 · Accepted Answer · answered Apr 13 '12 at 09:13

14

"word boundary" means exactly what it says: the boundary of a word, i.e. either the beginning or the end.

It does not match any actual character in the input, but it will only match if the current match position is at the beginning or end of the word.

This is important because, unlike if you just matched whitespace, it will also match at the beginning or end of the entire input.

So '\bfoo' will match 'foobar' and 'foo bar' and 'bar foo', but not 'barfoo'.

'foo\b' will match 'foo bar' and 'bar foo' and 'barfoo', but not 'foobar'.

answered Apr 13 '12 at 09:13

Karl Knechtel

1

Please note that in these examples the result of the match will always only contain 'foo' from e.g. 'foo bar' and so on. Just to make this clear. – HWende Apr 13 '12 at 09:21
Yes. Also, "match" is actually imprecise, as you'd have to use `re.search` to get a positive result for the strings not starting with `foo`. – Karl Knechtel Apr 13 '12 at 09:26
What characters are considered for word boundaries? Would `foo\b` match `foo-bar`, `foo_bar`, `foo=bar`, or `foo.bar`? – Stevoisiak Mar 01 '23 at 19:59
1

@Stevoisiak I'm not sure that I knew that confidently in 2012, although I certainly could have researched and tested it. That said, your comment drew my attention to the fact that this question is a duplicate. The canonical, which I have now used to close this question as a duplicate, includes answers that explain the matter very well. – Karl Knechtel Mar 01 '23 at 20:10

score -1 · Answer 2 · answered Apr 13 '12 at 09:12

-1

Try this:

ip="192.168.254.1234"
res = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",ip)
print(res)

Notice how I correctly escaped the dots. The ip is found because the regex doesn't care what comes after the last 1-3 digits.

Now:

ip="192.168.254.1234"
res = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",ip)
print(res)

This will not work, since the last 1-3 digits are NOT ENDING AT A BOUNDARY.

answered Apr 13 '12 at 09:12

HWende

1

Matching the dot was a edit mistake please dont mind.I have corrected it now – Rajeev Apr 13 '12 at 09:17
This answer doesn't address the revised question by OP, suggest you delete it. – smci Jun 09 '20 at 09:47

2 Answers2