3

I am running a series of OCR on images using tess4j as a wrapper for tesseract from JAVA. The process of ocr is still taking a significant amount of time (even 5 seconds sometimes) and I am trying to speed it up.

I am doing my own preprocessing and binarization of the image and it is not necessary for tesseract to do the otsu binarization.

I have read a tutorial for IOS that allows skipping the graphical processing part , but i can't find anything using tess4j.

The turial here: https://github.com/gali8/Tesseract-OCR-iOS/wiki/Tips-for-Improving-OCR-Results -
"... if you've already performed your own pre-processing/thresholding [...] you will probably want to bypass the internal Tesseract thresholding step. "

Does anybody know how I could use tess4j (from JAVA) in a way that would skip the otsu binarization?

user3452075
  • 411
  • 1
  • 6
  • 17

1 Answers1

1

Check tesseract-ocr parameters list for any settings applicable. But I read that if you send in a binarized image, Tesseract will skip the thresholding on the image (source).

nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • I tested with a colorized image and the binarized version and there were no difference in time. I am sending the images as PNG, do you know if i should set any attribute in the image to monochrome? – user3452075 Oct 21 '15 at 06:52
  • The [thresholder](https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/thresholder.cpp) tests image's bit depth / 8 == 0 to determine whether to do it or not. So make sure your image has 1bpp. – nguyenq Oct 21 '15 at 23:38