1

I am using the Tesseract library to extract text from images. The language is Vietnamese. I have two images. The first one is from a website. The second is a screenshot taken from the Wordpad program. They are shown in links below:

1

enter image description here

2

enter image description here

The first one has 95% accuracy.

Bán căn hộ tầng 5 khu tập thể Thành công Bắc, DT 28m2, gần chợ ThànhCông, số đỏ, chính chủ, giá 800 triệu.LH:A.Châu, 0979622551,0905685336

The second image is much larger but the accuracy is just about 60%.

Bặn căn hộ tầng ậ khu tập thể Ỉhành gông Băc. llĩ 28 m2. gân chợ ĩllành Bông. sũ Ilỏ. chính l:lIlì. giá 800 lriệu. l.ll: A.BhâU, 0979622551, 0905685336

What about the second image do I have to fix to get as accurate text as the first one?

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
mai nguyen
  • 41
  • 3

1 Answers1

0

As stated by @user898678 in image processing to improve tesseract OCR accuracy ,
the following operations can improve OCR's accuracy :

  • fix DPI (if needed) 300 DPI is minimum
  • fix text size (e.g. 12 pt should be ok)
  • try to fix text lines (deskew and dewarp text)
  • try to fix illumination of image (e.g. no dark part of image binarize and de-noise image
Community
  • 1
  • 1
chalasr
  • 12,971
  • 4
  • 40
  • 82
  • the font text size in the second image is larger than 12pt. The words in the images are in the same straight line. Also, i don't see any dark part of the image. I am thinking about ways to improve the boldness level of the text in the second image. That might fix the problem but i don't know how – mai nguyen Jan 11 '16 at 04:51