Bypassing Tesseract preProcessing

Question

I am running a series of OCR on images using tess4j as a wrapper for tesseract from JAVA. The process of ocr is still taking a significant amount of time (even 5 seconds sometimes) and I am trying to speed it up.

I am doing my own preprocessing and binarization of the image and it is not necessary for tesseract to do the otsu binarization.

I have read a tutorial for IOS that allows skipping the graphical processing part , but i can't find anything using tess4j.

The turial here: https://github.com/gali8/Tesseract-OCR-iOS/wiki/Tips-for-Improving-OCR-Results -
"... if you've already performed your own pre-processing/thresholding [...] you will probably want to bypass the internal Tesseract thresholding step. "

Does anybody know how I could use tess4j (from JAVA) in a way that would skip the otsu binarization?

any news on that one? – Alexander Belokon Sep 27 '18 at 12:33 — Alexander Belokon, Sep 27 '18 at 12:33

score 1 · Answer 1 · answered Oct 21 '15 at 03:35

1

Check tesseract-ocr parameters list for any settings applicable. But I read that if you send in a binarized image, Tesseract will skip the thresholding on the image (source).

answered Oct 21 '15 at 03:35

nguyenq

8,212
1
16
16

I tested with a colorized image and the binarized version and there were no difference in time. I am sending the images as PNG, do you know if i should set any attribute in the image to monochrome? – user3452075 Oct 21 '15 at 06:52
The [thresholder](https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/thresholder.cpp) tests image's bit depth / 8 == 0 to determine whether to do it or not. So make sure your image has 1bpp. – nguyenq Oct 21 '15 at 23:38

Bypassing Tesseract preProcessing

1 Answers1