how to improve pytesseract arguments to work properly

Question

I would like to read this captcha using pytesseract:

I follow the advice here: Use pytesseract OCR to recognize text from an image

My code is:

import pytesseract
import cv2

def captcha_to_string(picture):
    image = cv2.imread(picture)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (3,3), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

    # Morph open to remove noise and invert image
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
    opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
    invert = 255 - opening

    cv2.imwrite('thresh.jpg', thresh)
    cv2.imwrite('opening.jpg', opening)
    cv2.imwrite('invert.jpg', invert)

    # Perform text extraction
    text = pytesseract.image_to_string(invert, lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
    return text

But my code returns 8\n\x0c which is nonsense.

This is how thresh looks like:

This is how opening looks like:

This is how invert looks like:

Can you help me, how can I improve captcha_to_string function to read the captcha properly? Thanks a lot.

score 2 · Accepted Answer · answered Nov 03 '21 at 18:29

You are on the right way. Removing the noise (small black spots in the inverted image) looks like the way to extract the text successfully.

FYI, the configuration of pytessearct makes the outcome worse only. So, I removed it.

My approach is as follows:

import pytesseract
import cv2
import matplotlib.pyplot as plt
import numpy as np 

def remove_noise(img,threshold):
    """
    remove salt-and-pepper noise in a binary image
    """
    filtered_img = np.zeros_like(img)
    labels,stats= cv2.connectedComponentsWithStats(img.astype(np.uint8),connectivity=8)[1:3]

    label_areas = stats[1:, cv2.CC_STAT_AREA]
    for i,label_area in enumerate(label_areas):
        if label_area > threshold:
            filtered_img[labels==i+1] = 1
    return filtered_img


def preprocess(img_path):
    """
    convert the grayscale captcha image to a clean binary image
    """
    img = cv2.imread(img_path,0)
    blur = cv2.GaussianBlur(img, (3,3), 0)

    thresh = cv2.threshold(blur, 150, 255, cv2.THRESH_BINARY_INV)[1]

    filtered_img = 255-remove_noise(thresh,20)*255

    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
    erosion = cv2.erode(filtered_img,kernel,iterations = 1)
    return erosion

def extract_letters(img):
    text = pytesseract.image_to_string(img)#,config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

    return text


img = preprocess('captcha.jpg')

text=extract_letters(img)
print(text)

plt.imshow(img,'gray')
plt.show()

This is the processed image.

And, the script returns 18L9R.

how to improve pytesseract arguments to work properly

1 Answers1