PDFminer only works if PDF is manually made searchable

Asked Feb 26 '20 at 21:19

Active Feb 26 '20 at 21:19

Viewed 316 times

I would like to extract text from PDF files using PDFminer and Jupyter Notebook.

Here is an example of a PDF file from which I would like to extract text. When I use the code posted here, the output contains only the page one footer, while the rest of the document gets missed.

However, if I first use the Nitro Pro tool's OCR functionality to manually make the PDF file searchable, I am able to subsequently use the above Python code to extract all the text from the file.

I checked the PDFminer documentation to see if there is a parameter that I'm setting incorrectly, but I couldn't find anything on this issue. I would like to convert many files, so converting each file manually, using the Nitro Pro tool, is not feasible.

asked Feb 26 '20 at 21:19

b00kgrrl

2

That is a scanned document, consisting solely of *images*. Until OCR has been performed on it, there is simply no text to be extracted. – jasonharper Feb 26 '20 at 22:49
That's what I figured. I guess I was hoping that PDFminer would have an OCR option. – b00kgrrl Feb 26 '20 at 23:06

PDFminer only works if PDF is manually made searchable

0 Answers0

Linked