4

I am trying to search for exact words in a file. I read the file by lines and loop through the lines to find the exact words. As the in keyword is not suitable for finding exact words, I am using a regex pattern.

def findWord(w):
    return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search

The problem with this function is that is doesn't recognizes square brackets [xyz].

For example

findWord('data_var_cod[0]')('Cod_Byte1 = DATA_VAR_COD[0]') 

returns None whereas

findWord('data_var_cod')('Cod_Byte1 = DATA_VAR_COD') 

returns <_sre.SRE_Match object at 0x0000000015622288>

Can anybody please help me to tweak the regex pattern?

ucMedia
  • 4,105
  • 4
  • 38
  • 46
BitsNPieces
  • 91
  • 1
  • 7

4 Answers4

2

It's because of that regex engine assume the square brackets as character class which are regex characters for get ride of this problem you need to escape your regex characters. you can use re.escape function :

def findWord(w):
    return re.compile(r'\b({0})\b'.format(re.escape(w)), flags=re.IGNORECASE).search

Also as a more pythonic way to get all matches you can use re.fildall() which returns a list of matches or re.finditer which returns an iterator contains matchobjects.

But still this way is not complete and efficient because when you are using word boundary your inner word must contains one type characters.

>>> ss = 'hello string [processing] in python.'  
>>>re.compile(r'\b({0})\b'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss)
>>> 
>>>re.compile(r'({})'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss).group(0)
'[processing]'

So I suggest to remove the word boundaries if your words are contains none word characters.

But as a more general way you can use following regex which use positive look around that match words that surround by space or come at the end of string or leading:

r'(?: |^)({})(?=[. ]|$) '
Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • Hello it still returns None for: findWord('data_var_cod[0]')('Cod_Byte1 = DATA_VAR_COD[0]') – BitsNPieces Jul 21 '15 at 07:31
  • @BitsNPieces Hi ;) did you removed the word boundaries? – Mazdak Jul 21 '15 at 07:33
  • Yes It works after removing the boundaries! Thanks a lot :) – BitsNPieces Jul 21 '15 at 07:42
  • Hi there is a slight problem after removing the boundaries. Now it matches sequence of char and not exact words. For example it returns true for findWord('data_var_cod[0]')('Cod_Byte1=DATA_VAR_COD[0]') where as the intended behavior should return None as DATA_VAR_COD[0] is not a separate word! – BitsNPieces Jul 21 '15 at 08:19
  • @BitsNPieces Yep, put space in your regex! check the edit! – Mazdak Jul 21 '15 at 08:29
  • Hi the look around is raising an error raise error, v # invalid expression error: look-behind requires fixed-width pattern – BitsNPieces Jul 21 '15 at 08:40
  • @BitsNPieces Put the look behind in a none capture group and get the first group (`group(1)`) as the result. – Mazdak Jul 21 '15 at 10:27
1

That's because [ and ] has special meaning. You should quote the string you're looking for:

re.escape(regex)

Will escape the regex for you. Change your code to:

return re.compile(r'\b({0})\b'.format(re.escape(w)), flags=re.IGNORECASE).search
                                      ↑↑↑↑↑↑↑↑↑

You can see what re.quote does for your string, for example:

>>> w = '[xyz]'
>>> print re.escape(w)
\[xyz\]
Maroun
  • 94,125
  • 30
  • 188
  • 241
0

You need a "smart" way of building the regex:

def findWord(w):
    if re.match(r'\w', w) and re.search(r'\w$', w):
        return re.compile(r'\b{0}\b'.format(w), flags=re.IGNORECASE).search
    if not re.match(r'\w', w) and not re.search(r'\w$', w):
        return re.compile(r'{0}'.format(w), flags=re.IGNORECASE).search
    if not re.match(r'\w', w) and re.search(r'\w$', w):
        return re.compile(r'{0}\b'.format(w), flags=re.IGNORECASE).search
    if re.match(r'\w', w) and not re.search(r'\w$', w):
        return re.compile(r'\b{0}'.format(w), flags=re.IGNORECASE).search

The problem is that some of your keywords will have word characters at the start only, others - at the end only, most will have word characters on both ends, and some will have non-word characters. To effectively check the word boundary, you need to know if a word character is present at the start/end of the keyword.

Thus, with re.match(r'\w', x) we can check if the keyword starts with a word character, and if yes, add the \b to the pattern, and with re.search(r'\w$', x) we can check if the keyword ends with a word character.

In case you have multiple keywords to check a string against you can check this post of mine.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

You can use a \ before [ or ].

For instance, to find 'abc[12]' in 'xyzabc[12]def', one can use

match_pattern = 'abc\[12\]'
Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92