1

I would like to read this captcha using pytesseract:

enter image description here

I follow the advice here: Use pytesseract OCR to recognize text from an image

My code is:

import pytesseract
import cv2

def captcha_to_string(picture):
    image = cv2.imread(picture)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (3,3), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

    # Morph open to remove noise and invert image
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
    opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
    invert = 255 - opening

    cv2.imwrite('thresh.jpg', thresh)
    cv2.imwrite('opening.jpg', opening)
    cv2.imwrite('invert.jpg', invert)

    # Perform text extraction
    text = pytesseract.image_to_string(invert, lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
    return text

But my code returns 8\n\x0c which is nonsense.

This is how thresh looks like:

enter image description here This is how opening looks like:

enter image description here This is how invert looks like:

enter image description here

Can you help me, how can I improve captcha_to_string function to read the captcha properly? Thanks a lot.

mozway
  • 194,879
  • 13
  • 39
  • 75
vojtam
  • 1,157
  • 9
  • 34

1 Answers1

2

You are on the right way. Removing the noise (small black spots in the inverted image) looks like the way to extract the text successfully.

FYI, the configuration of pytessearct makes the outcome worse only. So, I removed it.

My approach is as follows:

import pytesseract
import cv2
import matplotlib.pyplot as plt
import numpy as np 

def remove_noise(img,threshold):
    """
    remove salt-and-pepper noise in a binary image
    """
    filtered_img = np.zeros_like(img)
    labels,stats= cv2.connectedComponentsWithStats(img.astype(np.uint8),connectivity=8)[1:3]

    label_areas = stats[1:, cv2.CC_STAT_AREA]
    for i,label_area in enumerate(label_areas):
        if label_area > threshold:
            filtered_img[labels==i+1] = 1
    return filtered_img


def preprocess(img_path):
    """
    convert the grayscale captcha image to a clean binary image
    """
    img = cv2.imread(img_path,0)
    blur = cv2.GaussianBlur(img, (3,3), 0)

    thresh = cv2.threshold(blur, 150, 255, cv2.THRESH_BINARY_INV)[1]

    filtered_img = 255-remove_noise(thresh,20)*255

    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
    erosion = cv2.erode(filtered_img,kernel,iterations = 1)
    return erosion

def extract_letters(img):
    text = pytesseract.image_to_string(img)#,config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

    return text


img = preprocess('captcha.jpg')

text=extract_letters(img)
print(text)

plt.imshow(img,'gray')
plt.show()

This is the processed image.

binary captcha image

And, the script returns 18L9R.

Prefect
  • 1,719
  • 1
  • 7
  • 16