Python PDF text extraction - Unable to extract from a specific document with pdfminer/textract

Question

I am using Python to do a project which involves extracting text from many PDF documents, interestingly I've come across a document which is unable to be parsed by either of these projects:

https://github.com/euske/pdfminer/

https://github.com/deanmalmgren/textract

Indeed, even the command line tool pdftotext cannot extract the text from the document. It prints text at first, then proceeds to print garbage after about 2 minutes of extraction.

The document can be found here: https://www.aiaa.org/uploadedFiles/Events/Conferences/2013_Conferences/2013_-_GNC_Infotech/Promotional_Materials/GNC%202013%20Final%20Program.pdf

I'm interested in one of two solutions:

How could I accomplish the goal of extracting the text from this document in Python?
How could I detect documents like this in general, so I could avoid trying to parse them altogether?

Either of these solutions would be ideal, so thanks in advance!

score 0 · Answer 1 · answered Mar 27 '18 at 04:14

0

I use Jupyter with Python 3.6 under win10. In this case I have to use pdfminer.six.

I had to re-install all in these days. This does still work for me

answered Mar 27 '18 at 04:14

pyano

1,885
10
28

Python PDF text extraction - Unable to extract from a specific document with pdfminer/textract

1 Answers1