Text Detection of Labels using PyTesseract

Question

A label detection tool that automatically identifies and alphabetically sorts the images based on equipment number (19-V1083AI). I used the pytesseract library to convert the image to a string after the contours of the equipment label were identified. Although the code runs correctly, it never outputs the equipment number. It's my first time using the pytesseract library and the goodFeaturesToTrack function. Any help would be greatly appreciated!

Original Image

import numpy as np
import cv2
import imutils #resizeimage
import pytesseract # convert img to string
from matplotlib import pyplot as plt
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Read the image file
image = cv2.imread('Car Images/s3.JPG')

# Resize the image - change width to 500
image = imutils.resize(image, width=500)


# Display the original image
cv2.imshow("Original Image", image)
cv2.waitKey(0)

# RGB to Gray scale conversion
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow("1 - Grayscale Conversion", gray)
cv2.waitKey(0)

# Noise removal with iterative bilateral filter(removes noise while preserving edges)
gray = cv2.bilateralFilter(gray, 11, 17, 17)
cv2.imshow("2 - Bilateral Filter", gray)
cv2.waitKey(0)


corners = cv2.goodFeaturesToTrack(gray,60,0.001,10)

corners = np.int0(corners)

for i in corners:
    x,y = i.ravel()
    cv2.circle(image,(x,y),0,255,-1)
    coord = np.where(np.all(image == (255, 0, 0),axis=-1))
plt.imshow(image)

# Use tesseract to covert image into string
text = pytesseract.image_to_string(image, lang='eng')
print("Equipment Number is:", text)


plt.show()

Output Image2

Note: It worked with one of the images but not for the others Output Image2

The comments on your other question https://stackoverflow.com/q/61309123/42346 seem useful. Did you consider those? — mechanical_meat, Apr 27 '20 at 03:32
Yeah, I did but the issue was that I got the code to correctly identify one of the images, (similar to the others) but the others were still unsuccessful. So, I believe the above code does work correctly and it's a really minor problem which has something to do with how matplotlib library works! — RR3327, Apr 27 '20 at 04:11
Ah, ok. I've been trying to get it to work, and the circle drawing part seems like it's not helping... — mechanical_meat, Apr 27 '20 at 04:16
Appreciate it @mechanical_meat! I was working with a different code before this [link] (https://stackoverflow.com/questions/61203364/find-contours-based-on-edges/61204168#61204168) but then a comment on that question proposed a much simpler solution so I was trying to work with that. — RR3327, Apr 27 '20 at 04:24
@mechanical_meat I added the picture the code worked within the description above! I don't see any major discrepancies between both the pictures for the code not to work for all test cases. Thanks again for all the help! — RR3327, Apr 27 '20 at 04:30
If you were to find a bunch of text in the image would you be amenable to using a regular expression to get just the kind of text you're looking for? — mechanical_meat, Apr 27 '20 at 04:48
Absolutely! That shouldn't be a problem. Could you expand a little bit on what you meant by regular expression? — RR3327, Apr 27 '20 at 04:51
So, you'd match a pattern. In this case something *like*: two digits, a hyphen, a letter, four digits, a space, and two letters. — mechanical_meat, Apr 27 '20 at 04:56
Have a look here: https://www.researchgate.net/publication/255564283_Stroke_Width_Transform — , Apr 27 '20 at 06:10

mechanical_meat · Answer 1 · 2020-04-27T05:13:48.543

0

I found using a particular configuration option for PyTesseract will find your text -- and some noise. Here are the configuration options explained: https://stackoverflow.com/a/44632770/42346

For this task I chose: "Sparse text. Find as much text as possible in no particular order."

Since there's more "text" returned by PyTesseract you can use a regular expression to filter out the noise.

This particular regular expression looks for two digits, a hyphen, five digits or characters, a space, and then two digits or characters. This can be adjusted to your equipment number format as necessary, but I'm reasonably confident this is a good solution because there's nothing else like this equipment number in the returned text.

import re
import cv2
import pytesseract

image = cv2.imread('Fv0oe.jpg') 
text = pytesseract.image_to_string(image, lang='eng', config='--psm 11') 

for line in text.split('\n'): 
     if re.match(r'^\d{2}-\w{5} \w{2}$',line): 
         print(line)

Result (with no image processing necessary):

19-V1083 AI

edited Apr 27 '20 at 05:13

answered Apr 27 '20 at 05:03

mechanical_meat

163,903
24
228
223

Thanks a lot for the detailed description along with the code! I wasn't familiar with the configuration option within PyTesseract. Converting the original image to grey and adding a bilateral filter improved the detection for a majority of the images. However, some of the equipment numbers couldn't be detected at all. I am looking into the pictures to see if it's something wrong with that. But, if you could advise on what might possibly be wrong that would be great as well. – RR3327 Apr 27 '20 at 22:33
I'm honestly not at all sure why with the default configuration it cannot find the text. One thing is that the photo is taken from an angle as opposed to straight on. If you're going to look at the photography part of the process maybe you can determine if a better angle can be used to take the photos. – mechanical_meat Apr 27 '20 at 23:28

Text Detection of Labels using PyTesseract

1 Answers1