-1

I'm trying to write a script to extract numbers from the "Total Deviation" graph in pdf files that looks like this. The reason I am trying to extract the information from the location of the graph rather than parsing the whole file and filtering it is that pdfminer exports the numbers in various and unpredictable patters (I used this script). Sometimes it extracts the whole rows together and sometimes it extracts columns, so that's why I want to find a way to extract the numbers from various files in a consistent manner. Any suggestions would be much appreciated!

Pen Gerald
  • 31
  • 5

1 Answers1

0

Try pdfreader. You can extract either text containing "pdf markdown" and than parse it with regular expressions for example:

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

pdf_markdown = ""

try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        viewer.next()
except PageDoesNotExist:
    pass

data = my_total_deviation_parser(pdf_markdown)
Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77