3

I need to do OCR on images that have gone through a digital to analog (interlaced video) to digital conversion, then jpeg compressed (resulting in compression artifacts). I have not been able to locate the exact fonts used, but we'll be looking at a mix of sans serif - e.g., Arial, Calibri, and Tiresias might work well as a training set. There is no way to get around the jpeg compression. These are text-only, white-on-black images at standard def resolution (720x480 deinterlaced).

An example is located here, resized at 1000%:resized image capture

I've found a preprocessing pipeline that works fairly well for Tesseract:

  1. Resize to 400-600%
  2. Blur
  3. Threshold (binarization)
  4. Erode (get thinner stroke width)

One problem is that letters like 't' and 'f' end up with a diamond shape at the cross. Still, this process works well, but isn't quite perfect. So I'd like to train tesseract. My question:

How should I create the training set?

Should I try to emulate the analog-to-digital-to-analog by adding a small amount of noise, then compress with jpeg? Should I do preprocessing on my training set, similar to what I listed above? If I train with noisy jpeg compressed images to match my captured images, is it best to skip preprocessing on the captured images?

Additionally, any hints on getting rid of the conversion/compression artifacts without sacrificing the text would be appreciated.

BobIsNotMyName
  • 415
  • 4
  • 11
  • If you could describe Tesseract's core algorithm a bit--even if you can only find a few buzzwords--then I may be able to suggest some training methods. I've known about Tesseract for a while, but haven't had the time to tinker with it. Have you checked out this link: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3? Found it here: http://stackoverflow.com/questions/4908919/tesseract-ocr-library-learning-font?rq=1 – Rethunk Nov 06 '13 at 01:17

0 Answers0