hOCR Files with Tesseract / Determining if a PDF has high quality text layers

Question

I have a Tesseract 4.0 setup we are using with an LSTM model for OCR; incoming scanned PDFs are deconstructed into individual 300dpi upsampled PNGs, then deskewed and OCR'ed, then re-assembled into a PDF with text layers while also saving each page PNG for further display in a web browser.

On occasion we receive PDFs that have already been professionally transcribed with text layers, and running Tesseract across those would result in a loss of accuracy.

We also have a requirement to later classify certain portions of the PNG pages according to specific tags, for a machine learning application.

So the questions would be these:

1) Is there any way of determining if a PDF already has a text layer and to determine the accuracy of that text?

2) Can PDFs that already contain text layers be decomposed into individual per-page hOCR files, so that specific regions of those pages in PNG format could be highlighted with a bounding box and with the text retrieved for that region from the corresponding hOCR file?

3) When using Tesseract to save OCR text in hOCR format, does that provide enough information to be able to retrieve just an arbitrary chunk of text from the hOCR file that corresponds to an exact region on the PNG that the hOCR file was created from?

Thanks in advance

score 1 · Answer 1 · answered Feb 14 '18 at 07:19

There are different tools which transform a PDF with text layer into simple text or some HTML; just search e.g. for pdf2text or pdf2html. Therefore, you can determine whether a PDF has a text layer (Question 1.a) by using such a tool and checking that the text content is non-empty. Moreover, I would suggest to make some sanity check (e.g. reasonable word length, some words from a dictionary) on the text, in order to avoid to have only garbled text (part of Question 1.b).

I am not aware of any pdf2hocr tool (Question 2). It is certainly possible to come up with something like this. But maybe, it is easier to work with the output of one of the before mentioned pdf2html tools. There is a related issue in the ocr-fileformat repository: https://github.com/UB-Mannheim/ocr-fileformat/issues/57

Tesseract's hocr output will give you the coordinates of the bounding boxes of each line and also each word. Thus, you can calculate for a given region, which lines or words intersect with it and output its text content (Question 3). However, you don't have the position of the characters.

hOCR Files with Tesseract / Determining if a PDF has high quality text layers

1 Answers1