1

i'm trying to do a search in a PDF using PyPDF and return the page number the search term was found on using re.search. However, when the word has a hyphen in it, it doesn't work. For example, search for "abc-123" returns nothing. I tried the below code and it works for a search of "123" or "abc" but will not return "abc-123". Below is my code, which is from this thread.

# Open the pdf file
pdfFileObj = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    
String = 'abc-123'

# Extract text and do the search
for i in range(0, NumPages):
    PageObj = pdfReader.getPage(i)
    Text = PageObj.extractText()
    if re.search(String,Text):
        print("Pattern Found on Page: " + str(i))
        pdfFileObj.close()

Appreciate any help. Thanks in advance!

drbenno
  • 11
  • 1
  • What is in Text when you're expecting a match? If you test this with Python repr you'll see it works: `re.search('abc-123', 'cat+abc-123+dog')` – jarmod Jul 25 '21 at 15:17
  • Possible the '-' in PDF is a graphical hyphen not ASCII code 45 (-). Try searching for "abc.123" – CodeMonkey Jul 25 '21 at 15:40
  • jarmod - the Text is the text from a page of the PDF document (it's a mostly text (ASCII) document). For example the document has a line that is "ABC-02177 and ABC-01893" and when i search "ABC-01893" it doesn't return a hit. JasonM1 - If i search it for "abc-123" using any pdf viewer (sumatra, acorbat) it finds it, but it doesn't with the above code. – drbenno Jul 25 '21 at 16:41
  • hey JasonM1, i think it may be the hyphen. but the "abc.123" doesn't work., How do i use the re.DOTALL in the search line? – drbenno Jul 25 '21 at 20:53

1 Answers1

0

re.search looks for a pattern in a given string. Assuming the document is returned as a collection of strings or with newlines means that it won't search past the first line. Try findall instead and then take the first match.

...
matches = re.findall(String,Text)
if len(matches) > 0:
    print('Found a match ...')
else:
    print('No match found.')
...
MichaelD
  • 1,274
  • 1
  • 10
  • 16