0

I am writing a script to extract text from the images of questions like the one given below using tesseract. But I am talking about 1000-10000 images per day so manualy applying the methods won't be feasible. Is it possible that I can apply some general things to all the images?

This image is here to show the quality of images not that it is handwritten.

enter image description here

So this image is having some noise, blur and all. I have used otsu thresholding and closing already. What other methods should be there for pre-processing functions that I can use on each image safely so that it does not hurt good quality images but improves bad quality.

In other terms, can I use a sharpening kernel to every image? If yes, What should be average (minimal safe )filter size/value so that it does not hurt the good quality images

sharpen_kernel_choice1 = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
sharpen_kernel_choise2 = np.array[[0,-1,0], [-1,9,-1], [0,-1,0]]

I am also thinking about using Wiener Filter so that it can de-blur the image for example,

psf = np.ones((5, 5)) / 25 # what should be a safe kernel size
img = convolve2d(img, psf, 'same')
img += 0.1 * img.std() * np.random.standard_normal(img.shape) # what should be this value?
deconvolved_img = restoration.unsupervised_wiener(img, psf)
Deshwal
  • 3,436
  • 4
  • 35
  • 94
  • you can try Adaptive histogram equalization (CLAHE) before thresholding – Ziri Jul 27 '20 at 04:20
  • Does this answer your question? [image processing to improve tesseract OCR accuracy](https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy) – Night Train Jul 29 '20 at 09:04

1 Answers1

1

You want to use edge preserving filters like anisotropic diffusion. The only implementation in python I know of can be found in medpy.

medpy.filter.smoothing.anisotropic_diffusion(gray_image, option = 3)

I tried your image and found unsharp masking works even better due to the edge augmenting. And for exposure modification you should defenitely use a local filter, CLAHE as mentioned by Ziri is the best and easiest to use that I can think of.

from skimage.exposure import equalize_adapthist
from skimage.filters import threshold_otsu, unsharp_mask

img = unsharp_mask(image)
equalize_adapthist(img) > threshold_otsu(equalize_adapthist(img))

The Tesseract project also provides a page dedicated to Image Preprocessing.

Night Train
  • 2,383
  • 3
  • 18
  • 35
  • Thanks for the answer but the real problem is that I want to know if there is a generalised approach. IF only I had to extract from say 10 images per day, I would have manually tweaked the images but I am looking at around 1000-10000 images per day. So is there a generalised approach for that? – Deshwal Jul 29 '20 at 10:32
  • Yes there are a few tutorials, that basically go through the steps that are mentioned on the image preprocessing page of Tesseract. That's what you would usually do. But I am not sure to which extent Tesseract does them itself. If your data is connected to a specific topic it would also be beneficial to train a network for this problem. Also `tesserocr` seems to be a more up to date python wrapper for Tesseract. And last but not least there exist an open source project for handwritten math expressions: https://github.com/falvaro/seshat – Night Train Jul 29 '20 at 11:05
  • I think these will not work for me in the case. I think either a De Noising GAN or Auto Encoder I should use so that I can get the best of ML. This is just one of the types of images that I have. Almost every image is different so I need to train on around 2-3Lkhs images for Autoencoder I think. – Deshwal Jul 30 '20 at 03:11
  • 1
    Now that you're mentioning it I just stumbled upon a tool for denoising and super resolution, maybe that will be worth taking a look. It also worked quite well for making almost not readable letters easily readable, maybe it will also work for hand written letters. https://github.com/tamarott/SinGAN – Night Train Jul 30 '20 at 06:32
  • thanks bud! Will do that , I think. Been advised about the GAN part by a few people. – Deshwal Jul 30 '20 at 11:19