I'm trying to write a script to extract numbers from the "Total Deviation" graph in pdf files that looks like this. The reason I am trying to extract the information from the location of the graph rather than parsing the whole file and filtering it is that pdfminer exports the numbers in various and unpredictable patters (I used this script). Sometimes it extracts the whole rows together and sometimes it extracts columns, so that's why I want to find a way to extract the numbers from various files in a consistent manner. Any suggestions would be much appreciated!
Asked
Active
Viewed 308 times
1 Answers
0
Try pdfreader. You can extract either text containing "pdf markdown" and than parse it with regular expressions for example:
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
viewer.next()
except PageDoesNotExist:
pass
data = my_total_deviation_parser(pdf_markdown)

Maksym Polshcha
- 18,030
- 8
- 52
- 77