0

I am working on a application that reads attachments from the e-mail(PDF's in general) and performs OCR operation on the PDF files. The problem is that I need to save the files to the HDD first and I think that this is not necessary. Is it possible to perform the OCR operation over a byte array without having to save the files to the disk first?

thanks in advance, Igor

Rodrigo
  • 37
  • 5
  • Does http://tess4j.sourceforge.net/docs/docs-2.0/net/sourceforge/tess4j/Tesseract.html help you? There are plenty of doOCR function prototypes, just choose one which fits you best. – Dmitrii Z. Jan 10 '18 at 12:34
  • I have seen this page before. Unfortunately not! Thanks for your help! – Rodrigo Jan 10 '18 at 16:12
  • It looks like Tesseract is using [PdfUtilities](http://tess4j.sourceforge.net/docs/docs-0.4/net/sourceforge/vietocr/PdfUtilities.html#convertPdf2Png%28java.io.File%29) class for PDF to IMAGE conversion. So you would need to convert it by your own. You can try [this](https://stackoverflow.com/questions/10862928/get-the-1st-page-of-a-pdf-as-image-from-the-byte-array-of-the-pdf) one which converts pdf byteArray to image which you will feed into tess4j – Dmitrii Z. Jan 10 '18 at 16:24
  • Ok. I will have a look and see. If I have good results I will post here in the future. Thank you! – Rodrigo Jan 12 '18 at 13:58
  • @Rodrigo you can try to save it in a temp file, although this is still on disk it will always be only the one file? – Tinus Jackson Jan 24 '18 at 12:29

0 Answers0