What does the Tesseract OCR library require of an image to be able to accurately extract text?

Question

I am using the Tesseract library to extract text from images. The language is Vietnamese. I have two images. The first one is from a website. The second is a screenshot taken from the Wordpad program. They are shown in links below:

1

2

The first one has 95% accuracy.

Bán căn hộ tầng 5 khu tập thể Thành công Bắc, DT 28m2, gần chợ ThànhCông, số đỏ, chính chủ, giá 800 triệu.LH:A.Châu, 0979622551,0905685336

The second image is much larger but the accuracy is just about 60%.

Bặn căn hộ tầng ậ khu tập thể Ỉhành gông Băc. llĩ 28 m2. gân chợ ĩllành Bông. sũ Ilỏ. chính l:lIlì. giá 800 lriệu. l.ll: A.BhâU, 0979622551, 0905685336

What about the second image do I have to fix to get as accurate text as the first one?

Try re-training Tesseract for the second font. – rmtheis Jan 10 '16 at 18:01 — rmtheis, Jan 10 '16 at 18:01

score 0 · Answer 1 · edited May 23 '17 at 10:28

0

As stated by @user898678 in image processing to improve tesseract OCR accuracy ,
the following operations can improve OCR's accuracy :

fix DPI (if needed) 300 DPI is minimum
fix text size (e.g. 12 pt should be ok)
try to fix text lines (deskew and dewarp text)
try to fix illumination of image (e.g. no dark part of image binarize and de-noise image

edited May 23 '17 at 10:28

Community

1
1

answered Jan 10 '16 at 15:39

chalasr

12,971
4
40
82

the font text size in the second image is larger than 12pt. The words in the images are in the same straight line. Also, i don't see any dark part of the image. I am thinking about ways to improve the boldness level of the text in the second image. That might fix the problem but i don't know how – mai nguyen Jan 11 '16 at 04:51

What does the Tesseract OCR library require of an image to be able to accurately extract text?

1

2

1 Answers1