I am trying to extract the title of a PDF file. The metadata of the file doesn't really help. So I am thinking of converting the first page of each PDF file to images and read this image using Tesseract. I can assume that the largest text found on the image is the title.
I read the PDF using fitz
and load the first page to be stored into an image format.
import fitz
doc = fitz.open(filename)
page = doc.loadPage(0)
pix = page.getPixmap()
pix.writePNG("output.png")
Then I read the image file using OpenCV, put it into tesseract, and put bounding boxes on the words detected.
filename = 'output.png'
img = cv2.imread(filename)
h, w, _ = img.shape
boxes = pytesseract.image_to_boxes(img) # also include any config options you use
for b in boxes.splitlines():
b = b.split(' ')
img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
cv2.imshow(filename, img)
cv2.waitKey(0)
I am not really familiar with OCR tesseract
so here's where I am stuck. How do I get the text with the largest bounding boxes?
My PDF files are mostly scientific papers/journals. So you get the idea of how my files look like.
Thank you.