0

I need to extract text from every page from a PDF file. My strategy for achieving it is simple, I use pdftotext when text is available directly (in other words, when I'm able to highlight and copy text from page) and tesseract when page is made of scanned images. In order to identify pages for OCR, I use pdfimages. It works well in most cases, but one intrigues me.

I'm not able to highlight or copy text from page 2 of this document. I've tried pdftotext, but it extracts only the text of the header of the page. I've tried also pdfimages, but it only extracts the image of one logo.

As Python is the language that I'm most comfortable with, I've tried the solutions suggested in this SO question, but none helped me out.

The content that I need to extract is the following:

enter image description here

But I'm only able to extract the content of its header:

enter image description here

I've identified the same behavior for other pages from that document. It's very important to me to extract the image(s) or the text from each page in a automatic fashion.

pdfimages output:

Lenovo-g at ~/foo ±(master) ✗ ❯ pdfimages -list -f 2 -l 2 doc.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   2     0 image      59    76  rgb     3   8  jpeg   no     40431  0    72    72 2067B  15%

Method get_images from Page Object of PyMuPDF returns the following for the page 2 of the document mentioned before:

(Pdb) page.get_images()
[(25, 0, 59, 76, 8, 'DeviceRGB', '', 'Im0', 'DCTDecode')]
Kfcaio
  • 442
  • 1
  • 8
  • 20
  • Page 2 (of the 7155 pages!) does not contain any text. It does contain 1 image (a logo) and 6 Form XObjects which draw a small amount of text (eg "(Disponibiliza\347\343o: ) Tj") which appear to be the header and footer content (drawn in blue). The majority of the content on the page however is drawn as linework; sequences of lines and curves which are filled. The only way for a computer to recreate text from such a document is with OCR. – KenS Dec 20 '21 at 19:59
  • Any way to identify pages made mostly of linework before ocr? As pointed before, I need to know, for each page, if pdftotext will be enough or if I'll need to perform ocr – Kfcaio Dec 20 '21 at 20:07

0 Answers0