2

I want to extract a whole word from a sentence. Thanks to this answer,

import re

def findWholeWord(w):
    return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search

I can get whole words in cases like:

findWholeWord('thomas')('this is Thomas again')   # -> <match object>
findWholeWord('thomas')('this is,Thomas again')   # -> <match object>
findWholeWord('thomas')('this is,Thomas, again')  # -> <match object>
findWholeWord('thomas')('this is.Thomas, again')  # -> <match object>
findWholeWord('thomas')('this is ?Thomas again')  # -> <match object>

where symbols next to the word don't bother.

However if there's a number it doesn't find the word.

How should I modify the expression to match cases where there's a number next to the word? Like:

findWholeWord('thomas')('this is 9Thomas, again')
findWholeWord('thomas')('this is9Thomas again')
findWholeWord('thomas')('this is Thomas36 again')
LRD
  • 351
  • 3
  • 13

2 Answers2

2

Can use the regexp (?:\d|\b){0}(?:\d|\b) to match the target word with either a word-boundary or a digit on either side of it.

import re

def findWholeWord(w):
    return re.compile(r'(?:\d|\b){0}(?:\d|\b)'.format(w), flags=re.IGNORECASE).search

for s in [
    'this is thomas',
    'this is Thomas again',
    'this is,Thomas again',
    'this is,Thomas, again',
    'this is.Thomas, again',
    'this is ?Thomas again',
    'this is 9Thomas, again',
    'this is9Thomas again',
    'this is Thomas36 again',
    'this is 1Thomas2 again',
    'this is -Thomas- again',
    'athomas is no match',
    'thomason no match']:
    print("match >" if findWholeWord('thomas')(s) else "*no match* >", s)

Output:

match > this is thomas
match > this is Thomas again
match > this is,Thomas again
match > this is,Thomas, again
match > this is.Thomas, again
match > this is ?Thomas again
match > this is 9Thomas, again
match > this is9Thomas again
match > this is Thomas36 again
match > this is 1Thomas2 again
match > this is -Thomas- again
*no match* > athomas is no match
*no match* > thomason no match

If you want to reuse the same target word against multiple inputs or in a loop then you can assign findWholeWord() call to a variable then call it.

matcher = findWholeWord('thomas')
print(matcher('this is Thomas again'))
print(matcher('this is,Thomas again'))
CodeMonkey
  • 22,825
  • 4
  • 35
  • 75
  • 1
    This would work but may also pick this one "this is Thomas36b again", so a minor change would do re.compile(r'(?:\b\d+|\b){0}(?:\d+\b|\b)'.format(w), flags=re.I).search – omuthu Sep 19 '22 at 17:00
  • @omuthu Good point, the original poster needs to review various edge cases and refine the criteria for what is a match and what is not. – CodeMonkey Sep 19 '22 at 17:07
  • Thanks @CodeMonkey! exactly what I was looking for. In my problem the case @omuthu indicates doesn't cause (a priori) any problem, but it's a very good point to have in mind too! – LRD Sep 19 '22 at 17:23
1

You may use this code:

import re

def findWholeWord(w):
    return re.compile(r'(?:\d+{0}|{0}\d+|\b{0}\b)'.format(w), flags=re.I).search


print ( findWholeWord('thomas')('this is 9Thomas, again') )
print ( findWholeWord('thomas')('this is9Thomas again') )
print ( findWholeWord('thomas')('this is Thomas36 again') )
print ( findWholeWord('thomas')('this is Thomas again') )
print ( findWholeWord('thomas')('this is,Thomas again') )
print ( findWholeWord('thomas')('this is,Thomas, again') )
print ( findWholeWord('thomas')('this is.Thomas, again') )
print ( findWholeWord('thomas')('this is ?Thomas again') )
print ( findWholeWord('thomas')('this is aThomas again') )

Output:

<re.Match object; span=(8, 15), match='9Thomas'>
<re.Match object; span=(7, 14), match='9Thomas'>
<re.Match object; span=(8, 16), match='Thomas36'>
<re.Match object; span=(8, 14), match='Thomas'>
<re.Match object; span=(8, 14), match='Thomas'>
<re.Match object; span=(8, 14), match='Thomas'>
<re.Match object; span=(8, 14), match='Thomas'>
<re.Match object; span=(9, 15), match='Thomas'>
None

(?:\d+{0}|{0}\d+|\b{0}\b) will match given word with 1+ digits on either side or complete word.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thanks @anubhava. This solution seems to work fine too. I don't know which is the main difference with the accepted answer, but seems both do the same (at least they both do what I need). – LRD Sep 19 '22 at 17:33
  • Effectively it is same approach. CodeMonkey optimized this regex further – anubhava Sep 19 '22 at 17:34