3

I used the following function to find the exact match for words in a string.

def exact_Match(str1, word):
    result = re.findall('\\b'+word+'\\b', str1, flags=re.IGNORECASE)
    if len(result)>0:
        return True
    else:
        return False

exact_Match(str1, word)

But I get an exact match for both words "award" and "award-winning" when it only should be award-winning for the following string.

str1 = "award-winning blueberries"
word1 = "award"
word2 = "award-winning"

How can i get it such that re.findall will match whole words with hyphens and other punctuations?

user1251007
  • 15,891
  • 14
  • 50
  • 76
lost9123193
  • 10,460
  • 26
  • 73
  • 113

2 Answers2

7

Make your own word-boundary:

def exact_Match(phrase, word):
    b = r'(\s|^|$)' 
    res = re.match(b + word + b, phrase, flags=re.IGNORECASE)
    return bool(res)

copy-paste from here to my interpreter:

>>> str1 = "award-winning blueberries"
>>> word1 = "award"
>>> word2 = "award-winning"
>>> exact_Match(str1, word1)
False
>>> exact_Match(str1, word2)
True

Actually, the casting to bool is unnecessary and not helping at all. The function is better off without it:

def exact_Match(phrase, word):
    b = r'(\s|^|$)' 
    return re.match(b + word + b, phrase, flags=re.IGNORECASE)

note: exact_Match is pretty unconventional casing. just call it exact_match.

Elazar
  • 20,415
  • 4
  • 46
  • 67
  • thanks for the comment. However, it doesn't seem to work. I put the code in and it's returning None for all cases. – lost9123193 May 27 '13 at 03:58
  • @lost9123193 you probably did not copy the code, or made some changes. It works for me, and I have copied it from here. – Elazar May 27 '13 at 10:11
2

The problem with your initial method is that '\\b' does not denote the zero-width assertion search that your looking for. (And if it did, I would use r'\b' instead because backslashes can become a real hassle in regular expressions - see this link)

From Regular Expression HOWTO

\b

Word boundary. This is a zero-width assertion that matches only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.

Because - is a non-alphanumeric character, your findall regular expression will find award in award-wining but not in awards.

Depending on your searched phrase, I would also think of using re.findall instead of re.match as suggested by Elazar. In your example re.match works, but if the word you are looking for is nested anywhere beyond the beginning of the string, re.match will not succeed.

Gronk
  • 381
  • 1
  • 3
  • 12