How to properly search a regular expression through Python in a very long PDF that is also formatted?

Question

I have a very long PDF (I'm talking about more than 1000 pages) I'm trying to parse to look for a regular expression in it. If I try with a simple and common word, my code works, but with the regex I'm trying to search it won't work. I'm pretty sure my regex is right, I tried it in regex101. My guess is that some parts of the PDF are formatted in a specific way (with tables, sort of, but without table lines) and the parser can't read those parts. Is there any solution to this problem?

Here is the code:

import PyPDF2
import re
#regex = re.compile(r"\[(\s)prima(?!\S)")
#File is located on Desktop
pdfFileObj=open('fel.pdf','rb')          #Opening the File
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    for match in re.findall(r"\[(\s)prima(?!\S)", page.extractText()):
        print(match)

My regex means: find all the occurrences of "[ prima ".

findall returns the values of the capturing groups. You can omit the group around `(\s)` like `\[\sprima(?!\S)` — The fourth bird, Mar 23 '20 at 14:24
Yes, by using `\s` to want to match a whitespace char. The capturing group is used if you for example want to get/process the value (which is in this case a whitespace char) — The fourth bird, Mar 23 '20 at 14:27
I think you could change it to `re.findall(r"\[\sprima(?!\S)"` https://regex101.com/r/Avc2l7/1/ — The fourth bird, Mar 23 '20 at 14:30
thank you, done! but it still doesn't find anything (I'm sure the combination exists in the PDF, many times) — Anna, Mar 23 '20 at 14:33
The pattern `\[\sprima(?!\S)` matches `[`, then a mandarory whitespace char followed by `prima` and asserts a whitespace boundary to the right. Perhaps matching 0+ whitespace chars will help `\[\s*prima(?!\S)` https://regex101.com/r/SG5WQY/1 If you print `page.extractText()` your expected text is there? — The fourth bird, Mar 23 '20 at 14:39
it does! I'm noticing now though that the parser doesn't read all spaces properly and merges words! that's why it's not reading the regex — Anna, Mar 23 '20 at 14:44
Maybe glance at https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf — dstromberg, Mar 23 '20 at 15:37

How to properly search a regular expression through Python in a very long PDF that is also formatted?

0 Answers0