What thresholding (binarization) algorithm is used in Tesseract OCR?

Question

I am working on a project that needs accurate OCR results for images with rich background. So I am comparing results of two OCRs (one of them is Tesseract) to make my choice. The point is that results are strongly affected by the pre-processing step and especially image binarization. I extracted the binarized image of the other OCR and passed it to Tesseract which enhanced the results of Tesseract by 30-40%.

I have two questions and your answers would be of much help to me:

What binarization algorithm does tesseract use, and is it configurable?
Is there a way to extract the binarized image of Tesseract OCR so I can test the other OCR with it?

Thanks in advance :)

score 9 · Accepted Answer · answered Apr 01 '15 at 07:32

9

I think I have found the answers to my questions:

1- The binarization algorithm used is Otsu thresholding. You can see it here in line 179.

2- To get the binarized image, a method in tesseract api can be called:

PIX* thresholded = api->GetThresholdedImage(); //thresholded must be freed

answered Apr 01 '15 at 07:32

Baraa

1,476
1
16
19

1

Above link is broken, [here](https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/thresholder.cpp#L210) is a hopefully more permanent one. (BTW edit queue for this answer seems be full if someone can fix that) – gigabot May 21 '21 at 22:17

score 6 · Answer 2 · answered Jun 12 '16 at 18:28

Otsu thresholding is a global filter. You can use some local filter to get better results. You can look for Sauvalo's binarization see hereor Nick's here . Those both algorithm are Niblack's improvement. I used it to binarize my image for an OCR and I get better result Good luck

What thresholding (binarization) algorithm is used in Tesseract OCR?

2 Answers2