I have a Tesseract 4.0 setup we are using with an LSTM model for OCR; incoming scanned PDFs are deconstructed into individual 300dpi upsampled PNGs, then deskewed and OCR'ed, then re-assembled into a PDF with text layers while also saving each page PNG for further display in a web browser.
On occasion we receive PDFs that have already been professionally transcribed with text layers, and running Tesseract across those would result in a loss of accuracy.
We also have a requirement to later classify certain portions of the PNG pages according to specific tags, for a machine learning application.
So the questions would be these:
1) Is there any way of determining if a PDF already has a text layer and to determine the accuracy of that text?
2) Can PDFs that already contain text layers be decomposed into individual per-page hOCR files, so that specific regions of those pages in PNG format could be highlighted with a bounding box and with the text retrieved for that region from the corresponding hOCR file?
3) When using Tesseract to save OCR text in hOCR format, does that provide enough information to be able to retrieve just an arbitrary chunk of text from the hOCR file that corresponds to an exact region on the PNG that the hOCR file was created from?
Thanks in advance