Pytesseract unable to recognize characters in binary images

Question

Using various methods I have changed an image captcha to look somewhat like this

However while using Pytesseract OCR, the package is unable to identify any character and I think it is due to the line above the letters.

script.py

 cv2.imwrite(filename, imgOP)
 text = pytesseract.image_to_string(Image.open(filename))

Output in the console for the image is none

However when tried with another image (given below) I got the output as

PGKQKf

Which is wrong again because of the line above the letter T

I have used various techniques to clean the images such as erosion, dilation and also Probabilistic Hough Transform (result given below)

#Hough Line Transform
img = cv2.imread('Output1.png')
edges = cv2.Canny(img, 1000, 1500)
minLineLength = 0
maxLineGap = 10000000000
lines = cv2.HoughLinesP(edges, 1, np.pi / 180, 15, minLineLength, maxLineGap)
for x in range(0, len(lines)):
    for x1, y1, x2, y2 in lines[x]:
        cv2.line(img, (x1, y1), (x2, y2), (255, 255, 255), 2)

cv2.imwrite('houghlines3.jpg', img)

where the image after transformation looks somewhat like this

Any other combination of values of minLineLength and maxLineGap do not work.

How should one proceed forward? I had checked on various techniques to make Tesseract more accurate however I am confused as to which one should I use.

Other than Tesseract are there any other techniques that could be applied to get the desired the results.

I had thought of creating a mask, where using an online tool I had converted the image into 0 and 1 given below. However how to go about it and use it for identifying the characters ?

I don't understand what's the problem. Just delete the line above "T" and try again. After that post the result Tesseract give you. In a few, do more preprocessing on image, clean it. This is easy.. — lucians, May 07 '18 at 19:19
@Link when you mention clean it, what do you mean by that, I have used various methods as mentioned in the question above such as erosion and dilation(different iterations and kernels tried), whereby the line goes however then pytesseract cannot identify the image. I want a method which allows pytesseract to identify completely with 100% efficiency. Sometimes a 5 also is mistaken for an S, hence I am asking what methods could be added. — rut_0_1, May 10 '18 at 09:53
You should try MNIST (ONLY FOR NUMBERS) at this point. Tesseract it's not perfect. Aniother solution could be to train Tesseract but I don't know how to do this. With MNIST dataset you can achieve a good accuracy with numbers. Also, the images you posted can be cleaned 100% by deleting lines and points (check houghLines and blobs removal - look at some of my questions I made on this). — lucians, May 10 '18 at 10:25
@Link https://stackoverflow.com/questions/46472713/improve-houghlines-for-horizontal-lines-detect-python-opencv, I have used this question of yours for reference however yet I am not getting desired results(view updated question). I will look up for training data set for pytesseract and MNIST. — rut_0_1, May 13 '18 at 06:19

Pytesseract unable to recognize characters in binary images

0 Answers0