0

I have a very long PDF (I'm talking about more than 1000 pages) I'm trying to parse to look for a regular expression in it. If I try with a simple and common word, my code works, but with the regex I'm trying to search it won't work. I'm pretty sure my regex is right, I tried it in regex101. My guess is that some parts of the PDF are formatted in a specific way (with tables, sort of, but without table lines) and the parser can't read those parts. Is there any solution to this problem?

Here is the code:

import PyPDF2
import re
#regex = re.compile(r"\[(\s)prima(?!\S)")
#File is located on Desktop
pdfFileObj=open('fel.pdf','rb')          #Opening the File
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    for match in re.findall(r"\[(\s)prima(?!\S)", page.extractText()):
        print(match)

My regex means: find all the occurrences of "[ prima ".

Anna
  • 369
  • 2
  • 10
  • findall returns the values of the capturing groups. You can omit the group around `(\s)` like `\[\sprima(?!\S)` – The fourth bird Mar 23 '20 at 14:24
  • even if I need the space to be there? – Anna Mar 23 '20 at 14:25
  • Yes, by using `\s` to want to match a whitespace char. The capturing group is used if you for example want to get/process the value (which is in this case a whitespace char) – The fourth bird Mar 23 '20 at 14:27
  • I don't understand how i would have to change it :( – Anna Mar 23 '20 at 14:28
  • I think you could change it to `re.findall(r"\[\sprima(?!\S)"` https://regex101.com/r/Avc2l7/1/ – The fourth bird Mar 23 '20 at 14:30
  • thank you, done! but it still doesn't find anything (I'm sure the combination exists in the PDF, many times) – Anna Mar 23 '20 at 14:33
  • The pattern `\[\sprima(?!\S)` matches `[`, then a mandarory whitespace char followed by `prima` and asserts a whitespace boundary to the right. Perhaps matching 0+ whitespace chars will help `\[\s*prima(?!\S)` https://regex101.com/r/SG5WQY/1 If you print `page.extractText()` your expected text is there? – The fourth bird Mar 23 '20 at 14:39
  • it does! I'm noticing now though that the parser doesn't read all spaces properly and merges words! that's why it's not reading the regex – Anna Mar 23 '20 at 14:44
  • Maybe glance at https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf – dstromberg Mar 23 '20 at 15:37

0 Answers0