0

I am trying to extract the title of a PDF file. The metadata of the file doesn't really help. So I am thinking of converting the first page of each PDF file to images and read this image using Tesseract. I can assume that the largest text found on the image is the title.

I read the PDF using fitz and load the first page to be stored into an image format.

import fitz

doc = fitz.open(filename)
page = doc.loadPage(0)
pix = page.getPixmap()
pix.writePNG("output.png")

Then I read the image file using OpenCV, put it into tesseract, and put bounding boxes on the words detected.

filename = 'output.png'

img = cv2.imread(filename)
h, w, _ = img.shape

boxes = pytesseract.image_to_boxes(img) # also include any config options you use

for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)

cv2.imshow(filename, img)
cv2.waitKey(0)

I am not really familiar with OCR tesseract so here's where I am stuck. How do I get the text with the largest bounding boxes?

My PDF files are mostly scientific papers/journals. So you get the idea of how my files look like.

Thank you.

catris25
  • 1,173
  • 3
  • 20
  • 40
  • For arbitrary inputs, I guess, it's quite impossible to find a generic solution. Even scientific research papers have highly varying appearances. Instead of using `pytesseract.image_to_boxes`, you should use morphological operations (e.g. closing) to find candidate bounding boxes for the title, and then checking `x`, `y` coordinates, width and height to find the best candidate. Having that bounding box, you can simply use `pytesseract.image_to_string` on that subimage. But again, it's quite impossible to provide a solution without seeing some of your examples. – HansHirse Mar 25 '21 at 08:44

1 Answers1

0

Normally Tesseract returns the OCR operation result as a nested structure as follows:

  • Block
    • Lines
      • Words
        • Chars (only in Tesseract 3, for Tesseract 4 you only have words boxes)

Using pytesseract.image_to_data you should get data about line/word index.

My suggestion is to go through the words of each line and find the line with the largest average word height, which most probably is the title of the paper.

Please refer to this answer to see how to get words boxes

Baraa
  • 1,476
  • 1
  • 16
  • 19