-1

I'm searching for keywords in a pdf file so I'm trying to search for /AA or /Acroform like the following:

import re
l = "/Acroform "
s = "/Acroform is what I'm looking for"
if re.search (r"\b"+l.rstrip()+r"\b",s):
    print "yes"

why I don't get "yes". I want the "/" to be part of the keyword I'm looking for if it exist. any one can help me with it ?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
user3569815
  • 41
  • 2
  • 6

1 Answers1

0

\b only matches in between a \w (word) and a \W (non-word) character, or vice versa, or when a \w character is at the edge of a string (start or end).

Your string starts with a / forward slash, a non word character, so \W. \b will never match between the start of a string and /. Don't use \b here, use an explicit negative look-behind for a word character :

re.search(r'(?<!\w){}\b'.format(re.escape(l)), s)

The (?<!...) syntax defines a negative look-behind; like \b it matches a position in the string. Here it'll only match if the preceding character (if there is any) is not a word character.

I used string formatting instead of concatenation here, and used re.escape() to make sure that any regular expression meta characters in the string you are searching for are properly escaped.

Demo:

>>> import re
>>> l = "/Acroform "
>>> s = "/Acroform is what I'm looking for"
>>> if re.search(r'(?<!\w){}\b'.format(re.escape(l)), s):
...     print 'Found'
... 
Found
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • it gives me "SyntaxError: invalid syntax" – user3569815 May 10 '14 at 16:28
  • @user3569815: are you certain you copied the text correctly? It works fine for me. – Martijn Pieters May 10 '14 at 16:30
  • yes absolutely it worked, but what if I have a group of key words, some of them contain "/", I tried to apply the code above on my code but it doesn't give me the right answer – user3569815 May 10 '14 at 16:42
  • @user3569815: I have no idea what your group contains, but it'll normally work with and without slashes at the start. You could have a similar problem with the `\b` at the end; use `(?!\w)` instead to add a negative look-ahead. – Martijn Pieters May 10 '14 at 16:53
  • @Marjin Pieters when I applied it on a text like the example above it works fine. but with the pdf file it doesn't work. could I send you my code and the file am doing the test on it if it possible ? – user3569815 May 10 '14 at 20:13
  • 1
    PDF text is not always linear in the file; also see the proposed dupe target. – Martijn Pieters May 10 '14 at 23:19