Extract Numbers from a certain location in PDF files

Question

I'm trying to write a script to extract numbers from the "Total Deviation" graph in pdf files that looks like this. The reason I am trying to extract the information from the location of the graph rather than parsing the whole file and filtering it is that pdfminer exports the numbers in various and unpredictable patters (I used this script). Sometimes it extracts the whole rows together and sometimes it extracts columns, so that's why I want to find a way to extract the numbers from various files in a consistent manner. Any suggestions would be much appreciated!

score 0 · Answer 1 · answered Dec 19 '19 at 20:01

Try pdfreader. You can extract either text containing "pdf markdown" and than parse it with regular expressions for example:

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

pdf_markdown = ""

try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        viewer.next()
except PageDoesNotExist:
    pass

data = my_total_deviation_parser(pdf_markdown)

Extract Numbers from a certain location in PDF files

1 Answers1