Read all characters from the captcha in python using any library

Question

I want to read the captcha characters as one string.

The captcha which I want to parse is in specific pattern. The sample images are available below:

All above captcha images background is same and so on. Only characters are shuffle each time with it's position (i.e. Characters are not available in the specific direction) but seems like all the character length is same.

I tried to read the text from this images using pytesseract python library. Also, I tried examples available here but none of these work for me.

In this SO link, I found one solution which is market as correct/right answer but its working for only number captcha.

Also, many people are suggesting like first we should remove the noise/blur background from the image then we should process the image but it's quite unknowing thing for me.

I tried like below:

from PIL import Image, ImageEnhance, ImageFilter
import pytesseract

img = Image.open("test.png")
imagetext = pytesseract.image_to_string(img)
print(imagetext)

Can anyone please point me out the direction for the same?

@0day- Gone through thee above link contents but seems like it is not working for this input images. It provides blank results — ketan, Dec 04 '18 at 11:41

Alderven · Answer 1 · 2019-01-07T14:35:43.383

Since there is a limited number of symbols used in captcha (latin letters + digits) and symbols shape stays always the same you can create symbols library like this:

You need to name every image with appropriate name, e.g.: "F.png" and put to "lib" folder. Then run following script against your captchas:

import cv2
import glob
import numpy
import ntpath
from PIL import Image

IMAGES = ['S3RZX.png', 'HF482.png', 'YMMR9.png']
symbols = glob.glob('lib/*.png')


def guess_captcha(image):
    image = Image.open(image)
    pixels = image.load()
    size = image.size

    # Cleanup background noises from captcha
    for x in range(size[0]):
        for y in range(size[1]):
            if pixels[x, y][0] < 120:
                pixels[x, y] = (0, 0, 0)
            else:
                pixels[x, y] = (255, 255, 255)

    # Search symbols in captcha
    image = numpy.array(image)
    result = []
    for symbol in symbols:
        img_symbol = cv2.imread(symbol)
        match = cv2.matchTemplate(img_symbol, image, cv2.TM_CCOEFF_NORMED)
        if len(match):
            _, quality, _, location = cv2.minMaxLoc(match)
            if quality > 0.8:
                result.append({'x': location[0], 'symbol': ntpath.basename(symbol).replace('.png', '')})
    result = sorted(result, key=lambda k: k['x'])
    return ''.join([x['symbol'] for x in result])


for img in IMAGES:
    print('{} -> {}'.format(img, guess_captcha(img)))

What I've got on the end:

S3RZX.png -> S3RZX
HF482.png -> HF482
YMMR9.png -> YR9

First two captchas solved correctly. Last captcha solved incorrectly because of "M" symbols overlaying. Overlaying problem really hard to solve so better just skip such captchas.

@Alderven- Can you please provide the data which needs to put in "lib" directory? — ketan, Jan 24 '19 at 11:50
I don't have such data - it's up to you to collect variety of captcha images and extract symbols from them. — Alderven, Jan 24 '19 at 11:54

Read all characters from the captcha in python using any library

1 Answers1