I have a very long PDF (I'm talking about more than 1000 pages) I'm trying to parse to look for a regular expression in it. If I try with a simple and common word, my code works, but with the regex I'm trying to search it won't work. I'm pretty sure my regex is right, I tried it in regex101. My guess is that some parts of the PDF are formatted in a specific way (with tables, sort of, but without table lines) and the parser can't read those parts. Is there any solution to this problem?
Here is the code:
import PyPDF2
import re
#regex = re.compile(r"\[(\s)prima(?!\S)")
#File is located on Desktop
pdfFileObj=open('fel.pdf','rb') #Opening the File
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
for match in re.findall(r"\[(\s)prima(?!\S)", page.extractText()):
print(match)
My regex means: find all the occurrences of "[ prima ".