I am using Python to do a project which involves extracting text from many PDF documents, interestingly I've come across a document which is unable to be parsed by either of these projects:
https://github.com/euske/pdfminer/
https://github.com/deanmalmgren/textract
Indeed, even the command line tool pdftotext
cannot extract the text from the document. It prints text at first, then proceeds to print garbage after about 2 minutes of extraction.
The document can be found here: https://www.aiaa.org/uploadedFiles/Events/Conferences/2013_Conferences/2013_-_GNC_Infotech/Promotional_Materials/GNC%202013%20Final%20Program.pdf
I'm interested in one of two solutions:
- How could I accomplish the goal of extracting the text from this document in Python?
- How could I detect documents like this in general, so I could avoid trying to parse them altogether?
Either of these solutions would be ideal, so thanks in advance!