I need to do OCR on images that have gone through a digital to analog (interlaced video) to digital conversion, then jpeg compressed (resulting in compression artifacts). I have not been able to locate the exact fonts used, but we'll be looking at a mix of sans serif - e.g., Arial, Calibri, and Tiresias might work well as a training set. There is no way to get around the jpeg compression. These are text-only, white-on-black images at standard def resolution (720x480 deinterlaced).
An example is located here, resized at 1000%:
I've found a preprocessing pipeline that works fairly well for Tesseract:
- Resize to 400-600%
- Blur
- Threshold (binarization)
- Erode (get thinner stroke width)
One problem is that letters like 't' and 'f' end up with a diamond shape at the cross. Still, this process works well, but isn't quite perfect. So I'd like to train tesseract. My question:
How should I create the training set?
Should I try to emulate the analog-to-digital-to-analog by adding a small amount of noise, then compress with jpeg? Should I do preprocessing on my training set, similar to what I listed above? If I train with noisy jpeg compressed images to match my captured images, is it best to skip preprocessing on the captured images?
Additionally, any hints on getting rid of the conversion/compression artifacts without sacrificing the text would be appreciated.