I need to extract text from every page from a PDF file. My strategy for achieving it is simple, I use pdftotext when text is available directly (in other words, when I'm able to highlight and copy text from page) and tesseract when page is made of scanned images. In order to identify pages for OCR, I use pdfimages. It works well in most cases, but one intrigues me.
I'm not able to highlight or copy text from page 2 of this document. I've tried pdftotext, but it extracts only the text of the header of the page. I've tried also pdfimages, but it only extracts the image of one logo.
As Python is the language that I'm most comfortable with, I've tried the solutions suggested in this SO question, but none helped me out.
The content that I need to extract is the following:
But I'm only able to extract the content of its header:
I've identified the same behavior for other pages from that document. It's very important to me to extract the image(s) or the text from each page in a automatic fashion.
pdfimages output:
Lenovo-g at ~/foo ±(master) ✗ ❯ pdfimages -list -f 2 -l 2 doc.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
2 0 image 59 76 rgb 3 8 jpeg no 40431 0 72 72 2067B 15%
Method get_images from Page Object of PyMuPDF returns the following for the page 2 of the document mentioned before:
(Pdb) page.get_images()
[(25, 0, 59, 76, 8, 'DeviceRGB', '', 'Im0', 'DCTDecode')]