2

please be gentle, I'm new to Python. I've installed several modules, but cant find one that fits my needs. Maybe you can point me to the right one.

I want to search an already text-searchable PDF for a certain pattern ([A-Z][A-Z][0-9][0-9][0-9][0-9]) and it should give me back all the locations (X1,Y1,X2,Y2).

Is there anything similar to this that I could use?

Thanks!

Dawko
  • 77
  • 1
  • 1
  • 7
  • Step by step, I'm getting to it... This: https://stackoverflow.com/questions/25248140/how-does-one-obtain-the-location-of-text-in-a-pdf-with-pdfminer seems to be the way to go, but how implement a certain pattern? – Dawko May 15 '20 at 14:13

1 Answers1

0

got it, more or less, thats something I can work with. If you guys have a better solution, please please please step forward! I appreciate any help with this.

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure
import re

def parse_layout(layout):
    """Function to recursively parse the layout tree."""
    for lt_obj in layout:
        if isinstance(lt_obj, LTTextBox):
            if re.findall("[A-Z][A-Z][0-9][0-9][0-9][0-9]", lt_obj.get_text()):
                print(lt_obj.__class__.__name__)
                print(lt_obj.bbox)
                print(lt_obj.get_text())

fp = open('M:/test.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
    interpreter.process_page(page)
    layout = device.get_result()
    parse_layout(layout)
Dawko
  • 77
  • 1
  • 1
  • 7