Using Tesseract-OCR in Python to get number from images

Question

I have thousands of scale images that I would like to extract the reading of the scale from each image. However, when using the Tesseract it gives wrong values. I tried several process for the image but still running to same issue. From my understanding so far after defining region of interest in the image, it has to be converted to white text with black background. However, I am new to python, I tried some functions to do so but still running to same issue. Would be appreciated if someone can help me on this one. The following link is for the image, as I couldn't uploaded it here as it is more than 2 MiB: https://mega.nz/file/fZMUDRbL#tg4Tc2VmGMMdEpnZzt7blxZjVLdlhMci9jll0FLnIGI

import cv2
import pytesseract
import matplotlib.pyplot as plt
import numpy as np
import imutils

## Reading Image File
Filename = 'C:\\Users\\Abdullah\\Desktop\\Scale Reading\\'   #File Path For Images
IName = 'Disk_Test_1_09_07-00000_0.tif'   # Image Name
Image = cv2.imread(Filename + IName,0)


## Image Processing
Image_Crop = Image[1680:1890, 550:1240]   # Define Region of Interest of the image
#cv2.imshow("cropped", Image_Crop)         # Show Cropped Image
#cv2.waitKey(0)                           # Show Cropped Image
Mask = Image_Crop > 10                    # Thershold Image to Value of X
Mask = np.array(Mask, dtype=np.uint8)
plt.imshow(Mask, alpha=1) # Set Opacity (Max 1)
ret,Binary = cv2.threshold(Mask,0,255,cv2.THRESH_BINARY)
#plt.imshow(Image_Crop, cmap="gray")          # Transform Image to Gray
#plt.show()
plt.imshow(Binary,'gray',vmin=0,vmax=255)
plt.show()


## Number Recognition
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # Call Location of Tesseract-OCR
data = pytesseract.image_to_string(Binary, lang='eng',config='--psm 6')
print(data)

Here is the image after processing

enter image description here

better use black text on white background. See Tesseract documentation: [Improving the quality of the output](https://tesseract-ocr.github.io/tessdoc/ImproveQuality). You could also run directly in console `tesseract.exe --help`, `tesseract.exe --help-extra`, `tesseract.exe --help-psm` to see all options which you can use in `config=` — furas, Jul 16 '21 at 01:29
I have just tried to use black text with white background but still running to the same issue where it gives wrong number recognition. I Updated the code above and the number appear nicely but when running the Tesseract it gives random characters. `` — Abdullah, Jul 16 '21 at 04:16
you could add in question image after processing. `tesseract` may have problem when text is too small or too big and you may have to resize image and/or change it to 300 DPI (dots per inche). And you know where can be text then you could crop image. — furas, Jul 16 '21 at 04:27

Using Tesseract-OCR in Python to get number from images

0 Answers0