0

I would like to extract text from PDF files using PDFminer and Jupyter Notebook.

Here is an example of a PDF file from which I would like to extract text. When I use the code posted here, the output contains only the page one footer, while the rest of the document gets missed.

However, if I first use the Nitro Pro tool's OCR functionality to manually make the PDF file searchable, I am able to subsequently use the above Python code to extract all the text from the file.

I checked the PDFminer documentation to see if there is a parameter that I'm setting incorrectly, but I couldn't find anything on this issue. I would like to convert many files, so converting each file manually, using the Nitro Pro tool, is not feasible.

b00kgrrl
  • 559
  • 2
  • 9
  • 30
  • 2
    That is a scanned document, consisting solely of *images*. Until OCR has been performed on it, there is simply no text to be extracted. – jasonharper Feb 26 '20 at 22:49
  • That's what I figured. I guess I was hoping that PDFminer would have an OCR option. – b00kgrrl Feb 26 '20 at 23:06

0 Answers0