3

One of the problems that I am working on is to do OCR on documents. A few of the paystub document have a highlighted line with dots to differentiate important elements like Gross Pay, Net Pay, etc.

For Reference

These dots give erroneous results in OCR, it considers them as ':' character and doesn't give desired results. I have tried a lot of things for image processing such as ImageMagick, etc to remove these dots. But in each case the quality of entire text data is degraded resulting in poor OCR.

ImageMagick commands that I have tried is:

convert mm150.jpg -kuwahara 3 mm2.jpg

I have also tried connected components, erosion with kernels, etc, but each method fails in some way.

I would like to know if there is some method that I should follow, or am I missing something from Image Processing capabilities.

fmw42
  • 46,825
  • 10
  • 62
  • 80
Mohammed Jamali
  • 175
  • 1
  • 13
  • that resolution is ridiculous. even a cheap scanner can achieve 2000-4000 dpi with ease. your scan looks more like 100 dpi to me. the only way I see is to train some custom OCR to read those numbers on noisy background. crap in = crap out – Piglet Mar 16 '18 at 09:57
  • That was actually a screen capture, I have updated the image. It is of pretty good resolution. – Mohammed Jamali Mar 16 '18 at 10:04
  • assuming that this page is A4, US letter or similar we have 1700 pixels across 8 inches of paper. That's a resolution of 213dpi which is very poor. I mean your character lines are 1 pixel wide... but regardless of that the black dots have the same size and saturation as your characters. that makes it nearly impossible to remove that noise without removing a significant part of the characters. you'll need custom solution tailored to that problem. or make people send that stuff digitally. I mean this is 2018. – Piglet Mar 16 '18 at 10:46

1 Answers1

6

This issue can be resolved using connectedComponentsWithStats function of opencv. I found reference for this from this question How do I remove the dots / noise without damaging the text?

I changed it a bit to fit as per my needs. And this is the code that helped me get desired output.

    import cv2
    import numpy as np
    import sys

    img = cv2.imread(sys.argv[1], 0)
    _, blackAndWhite = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY_INV)


    nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(blackAndWhite, 4, cv2.CV_32S)
    sizes = stats[1:, -1] #get CC_STAT_AREA component
    img2 = np.zeros((labels.shape), np.uint8)

    for i in range(0, nlabels - 1):
        if sizes[i] >= 8:   #filter small dotted regions
            img2[labels == i + 1] = 255

    res = cv2.bitwise_not(img2)

    cv2.imwrite('res.jpg', res)

The output file that I got is pretty clear with dotted band removed such as it gives perfect OCR results.

enter image description here

Mohammed Jamali
  • 175
  • 1
  • 13