4

I have a processed captcha image(Enlarged) look like :
captcha

As you can see, the font-size of the "TEXT" is bit larger than the width of the Noisy Lines.
So I need an algorithm or code to remove the noisy lines from this image.

With the help of Python PIL Library and the chopping algorithm mentioned below I din't get the output image which could be easily read by OCRs.

Here's Python code that I tried :

import PIL.Image
import sys

# python chop.py [chop-factor] [in-file] [out-file]

chop = int(sys.argv[1])
image = PIL.Image.open(sys.argv[2]).convert('1')
width, height = image.size
data = image.load()

# Iterate through the rows.
for y in range(height):
    for x in range(width):

        # Make sure we're on a dark pixel.
        if data[x, y] > 128:
            continue

        # Keep a total of non-white contiguous pixels.
        total = 0

        # Check a sequence ranging from x to image.width.
        for c in range(x, width):

            # If the pixel is dark, add it to the total.
            if data[c, y] < 128:
                total += 1

            # If the pixel is light, stop the sequence.
            else:
                break

        # If the total is less than the chop, replace everything with white.
        if total <= chop:
            for c in range(total):
                data[x + c, y] = 255

        # Skip this sequence we just altered.
        x += total


# Iterate through the columns.
for x in range(width):
    for y in range(height):

        # Make sure we're on a dark pixel.
        if data[x, y] > 128:
            continue

        # Keep a total of non-white contiguous pixels.
        total = 0

        # Check a sequence ranging from y to image.height.
        for c in range(y, height):
            # If the pixel is dark, add it to the total.
            if data[x, c] < 128:
                total += 1

            # If the pixel is light, stop the sequence.
            else:
                break

        # If the total is less than the chop, replace everything with white.
        if total <= chop:
            for c in range(total):
                data[x, y + c] = 255

        # Skip this sequence we just altered.
        y += total

image.save(sys.argv[3])

So, basically I would like to know a better algorithm/code to get rid of the noise and thus able to make the image readable by the OCR (Tesseract or pytesser).

djadmin
  • 1,742
  • 3
  • 19
  • 27

3 Answers3

1

To quickly get rid of most of the lines, you can turn all black pixels with two or less adjacent black pixels white. That should fix the stray lines. Then, when you have a lot of "blocks" you can remove the smaller ones.

This is assuming the sample image has been enlarged, and the lines are only one pixel wide.

DXsmiley
  • 529
  • 5
  • 17
0

You could use your own dilate and erode functions, wich will remove the smallest lines. A nice implementation can be found here.

Bartlomiej Lewandowski
  • 10,771
  • 14
  • 44
  • 75
0

I personally use dilate and erode as stated above but combine that with some basic statistics for width and height, try to find outliers and eliminate those lines as needed. After that, a filter which takes the minimum value of a kernel and turns the central pixel that color in a temporary image (iterating down the old image) before using the temporary image as the original should work. In pillow/PIL the minimum based task is accomplished with img.filter(ImageFilter.MINFILTER).

IF that is not enough, it should produce an identifiable set for which OpenCV's contours and minimum bounding rotated box can be used to rotate a letter for comparison (I reccomend Tesseract or a commercial OCR at this point since they have a ton of fonts and extra features like clustering and cleanup).

Andrew Scott Evans
  • 1,003
  • 12
  • 26