3

My current project involves transcribing texts in pdf into text files, and I first tried putting the image file directly into OCR program (tesseract) and it didnt' do that well. The original image files are old news papers, basically, and have some background noises, which I am sure tesseract has problem with. So I am trying to use some image preprocessing before feeding it into tesseract. Is there any suggestion for open source image preprocessing engine that fits well to this situation??? And instructions on how to use it would be even more appreciated !

Sardonic
  • 441
  • 3
  • 8
  • 19

3 Answers3

5

I never heard of an "image preprocessing engine" for that purpose, but you can take a look at OpenCV (Open Source Computer Vision Library) and implement your own "pre-processing engine". OpenCV is a computer vision library that offers many features to perform image processing.

One interesting thing you might want test as a preprocessing step is apply a threshold to the image to remove noises and stuff. Anyway, I've talked about this kind of stuff in this thread.

Community
  • 1
  • 1
karlphillip
  • 92,053
  • 36
  • 243
  • 426
4

Like @karlphillip mentioned, I highly doubt there's a readily available preprocessing engine for your purposes as the preprocessing technique vary greatly with the desired result.

Some common approaches to clearing up the text in noisy images include: 1. Adaptive thresholding (Sauvola or Niblack binarization) 2. Applying a median filter of a size slightly larger than the text to get a background image, then subtract out the background from the original image (to remove the larger noise like creases, stains, handwritten notes, etc.).

OpenCV has implementations of these filters/binarization methods. If you have access to published literature there's quite a bit of work on binarization of noisy documents.

Noremac
  • 3,445
  • 5
  • 34
  • 62
  • So once I learn how to use OpenCV I can use those implemented methods to filter the document image?? – Sardonic Mar 23 '13 at 20:37
  • Looks like I was mistaken. OpenCV doesn't have Suvola or Niblack implementations (although there is an adaptive thresholding function which may give similar results). It does have Otsu binarization, which could work for you if there is consistent lighting across the entire image. So, in answer to your question, yes. – Noremac Mar 25 '13 at 13:57
0

Check out ScanTailor. It has pretty impressive pre-processing functionality and it is open source.

Ivar
  • 5,102
  • 2
  • 31
  • 43
  • 1
    Rotating, deskewing, and page splitting does not really impress me. There is MUCH more to do for OCR. Especially the conversion of a color image into a real black and white image is the important step. – Elmue Jan 09 '18 at 01:38
  • The ScanTailor project is no longer maintained, so the domains scantailor.sourceforge.net and scantailor.org are no longer available. You can still find the [archived project here](https://github.com/scantailor/scantailor). – D. S. Apr 08 '21 at 12:12