Python Regex Word Boundaries not working as expected

Question

Why isn't the word boundary working?

reading this site, I know a word boundary works like this:

There are three different positions that qualify as word boundaries:

Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.

The a string below appears to fit at least one of the positions listed above.

a = 'Builders Club The Ohio State'
re.sub('\bThe\b', '', a, flags=re.IGNORECASE)

output. There is no change in the 'The'.

'Builders Club The Ohio State'

Why isn't the word boundary working?

When I put spaces before and after ' The ' pattern, the regex appears to work.

a = 'Builders Club The Ohio State'
re.sub(' The ', ' ', a, flags=re.IGNORECASE)

output:

'Builders Club Ohio State'

score 25 · Accepted Answer · 2014-07-15T18:01:54.473

25

You need to use a raw-string for your Regex pattern (which does not process escape sequences):

>>> import re
>>> a = 'Builders Club The Ohio State'
>>> re.sub(r'\bThe\b', '', a, flags=re.IGNORECASE)
'Builders Club  Ohio State'
>>>

Otherwise, \b will be interpreted as a backspace character:

>>> print('x\by')
y
>>> print(r'x\by')
x\by
>>>

edited Jul 15 '14 at 18:01

answered Jul 15 '14 at 17:56

3

To elaborate: The backslash is an escape character in normal strings and thus `\b` becomes just [a backspace character](https://docs.python.org/2.0/ref/strings.html). So either you need to use `\\b` or a raw string literal. – Joey Jul 15 '14 at 17:58
Generally whenever using regex it's a good idea to use a raw string. – RevanProdigalKnight Jul 15 '14 at 17:58
ah ic. if I use r" will it mess up other characters, like ^ and $? – user3314418 Jul 15 '14 at 17:58
1

@user3314418 No, it only affects the number of backslashes you need to use (Hint: you don't need as many with a raw string) – RevanProdigalKnight Jul 15 '14 at 17:59
I got the point. I tried the regex `r'\b\[details\]\b'` to remove **[details]** in my text. But word boundary didn't work. It worked without `\b` since I don't have any text that contains **[details]** as a substring. Even though I have a solution for the data that I have I feel like it is not generalized. Any suggestions about what is going on in my code. – akalanka Aug 15 '19 at 17:29

score 2 · Answer 2 · answered Jul 15 '14 at 17:57

2

Try this one

import re
p = re.compile(ur'\bThe\b', re.IGNORECASE)
test_str = u"Builders Club The Ohio State"
subst = u""

result = re.sub(p, subst, test_str)

output:

Builders Club Ohio State

Here is DEMO

answered Jul 15 '14 at 17:57

Braj

2 Answers2