using Python to search for keywords in pdf

Question

I'm searching for keywords in a pdf file so I'm trying to search for /AA or /Acroform like the following:

import re
l = "/Acroform "
s = "/Acroform is what I'm looking for"
if re.search (r"\b"+l.rstrip()+r"\b",s):
    print "yes"

why I don't get "yes". I want the "/" to be part of the keyword I'm looking for if it exist. any one can help me with it ?

Martijn Pieters · Accepted Answer · 2014-05-10T16:33:36.363

0

\b only matches in between a \w (word) and a \W (non-word) character, or vice versa, or when a \w character is at the edge of a string (start or end).

Your string starts with a / forward slash, a non word character, so \W. \b will never match between the start of a string and /. Don't use \b here, use an explicit negative look-behind for a word character :

re.search(r'(?<!\w){}\b'.format(re.escape(l)), s)

The (?<!...) syntax defines a negative look-behind; like \b it matches a position in the string. Here it'll only match if the preceding character (if there is any) is not a word character.

I used string formatting instead of concatenation here, and used re.escape() to make sure that any regular expression meta characters in the string you are searching for are properly escaped.

Demo:

>>> import re
>>> l = "/Acroform "
>>> s = "/Acroform is what I'm looking for"
>>> if re.search(r'(?<!\w){}\b'.format(re.escape(l)), s):
...     print 'Found'
... 
Found

edited May 10 '14 at 16:33

answered May 10 '14 at 16:23

Martijn Pieters

1,048,767
296
4,058
3,343

it gives me "SyntaxError: invalid syntax" – user3569815 May 10 '14 at 16:28
@user3569815: are you certain you copied the text correctly? It works fine for me. – Martijn Pieters May 10 '14 at 16:30
yes absolutely it worked, but what if I have a group of key words, some of them contain "/", I tried to apply the code above on my code but it doesn't give me the right answer – user3569815 May 10 '14 at 16:42
@user3569815: I have no idea what your group contains, but it'll normally work with and without slashes at the start. You could have a similar problem with the `\b` at the end; use `(?!\w)` instead to add a negative look-ahead. – Martijn Pieters May 10 '14 at 16:53
@Marjin Pieters when I applied it on a text like the example above it works fine. but with the pdf file it doesn't work. could I send you my code and the file am doing the test on it if it possible ? – user3569815 May 10 '14 at 20:13
1

PDF text is not always linear in the file; also see the proposed dupe target. – Martijn Pieters May 10 '14 at 23:19

using Python to search for keywords in pdf

1 Answers1