Tessaract(tess4j) - Performing OCR over byte array

Question

I am working on a application that reads attachments from the e-mail(PDF's in general) and performs OCR operation on the PDF files. The problem is that I need to save the files to the HDD first and I think that this is not necessary. Is it possible to perform the OCR operation over a byte array without having to save the files to the disk first?

thanks in advance, Igor

Does http://tess4j.sourceforge.net/docs/docs-2.0/net/sourceforge/tess4j/Tesseract.html help you? There are plenty of doOCR function prototypes, just choose one which fits you best. — Dmitrii Z., Jan 10 '18 at 12:34
I have seen this page before. Unfortunately not! Thanks for your help! — Rodrigo, Jan 10 '18 at 16:12
It looks like Tesseract is using [PdfUtilities](http://tess4j.sourceforge.net/docs/docs-0.4/net/sourceforge/vietocr/PdfUtilities.html#convertPdf2Png%28java.io.File%29) class for PDF to IMAGE conversion. So you would need to convert it by your own. You can try [this](https://stackoverflow.com/questions/10862928/get-the-1st-page-of-a-pdf-as-image-from-the-byte-array-of-the-pdf) one which converts pdf byteArray to image which you will feed into tess4j — Dmitrii Z., Jan 10 '18 at 16:24
Ok. I will have a look and see. If I have good results I will post here in the future. Thank you! — Rodrigo, Jan 12 '18 at 13:58
@Rodrigo you can try to save it in a temp file, although this is still on disk it will always be only the one file? — Tinus Jackson, Jan 24 '18 at 12:29

Tessaract(tess4j) - Performing OCR over byte array

0 Answers0