-2

I am using Python to do a project which involves extracting text from many PDF documents, interestingly I've come across a document which is unable to be parsed by either of these projects:

https://github.com/euske/pdfminer/

https://github.com/deanmalmgren/textract

Indeed, even the command line tool pdftotext cannot extract the text from the document. It prints text at first, then proceeds to print garbage after about 2 minutes of extraction.

The document can be found here: https://www.aiaa.org/uploadedFiles/Events/Conferences/2013_Conferences/2013_-_GNC_Infotech/Promotional_Materials/GNC%202013%20Final%20Program.pdf

I'm interested in one of two solutions:

  1. How could I accomplish the goal of extracting the text from this document in Python?
  2. How could I detect documents like this in general, so I could avoid trying to parse them altogether?

Either of these solutions would be ideal, so thanks in advance!

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194

1 Answers1

0

I use Jupyter with Python 3.6 under win10. In this case I have to use pdfminer.six.

I had to re-install all in these days. This does still work for me

pyano
  • 1,885
  • 10
  • 28