OCR not performing well on clean image | Python Pytesseract

Question

I have been working on project which involves extracting text from an image. I have researched that tesseract is one of the best libraries available and I decided to use the same along with opencv. Opencv is needed for image manipulation.

I have been playing a lot with tessaract engine and it does not seems to be giving the expected results to me. I have attached the image as an reference. Output I got is:

1] =501 [

Instead, expected output is

TM10-50%L

What I have done so far:

Remove noise
Adaptive threshold
Sending it tesseract ocr engine

Are there any other suggestions to improve the algorithm?

Thanks in advance.

Snippet of the code:

import cv2
import sys
import pytesseract
import numpy as np
from PIL import Image

if __name__ == '__main__':
  if len(sys.argv) < 2:
    print('Usage: python ocr_simple.py image.jpg')
    sys.exit(1)

  # Read image path from command line
  imPath = sys.argv[1]
  gray  = cv2.imread(imPath, 0)
  # Blur
  blur  = cv2.GaussianBlur(gray,(9,9), 0)
  # Binarizing
  thres = cv2.adaptiveThreshold(blur, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 5, 3)
  text = pytesseract.image_to_string(thresh)
  print(text)

Images attached. First image is original image. Original image

Second image is what has been fed to tessaract. Input to tessaract

You need to preprocess the image before throwing into OCR with the target to extract in the foreground in black and the background in white. — nathancy, May 11 '22 at 01:31
@nathancy Hi man, Much thanks for your suggestions. I would definitely go over them. Meanwhile, I have attached couple of images, original and the one which is being fed to `tessaract`. Could you share your suggestions on this as well. — srikanth2016, May 11 '22 at 06:54
@nathancy We are already doing the preprocessing of the image before feeding to tessaract. Attached the image for the reference. Do you see any problem in that as well? — srikanth2016, May 11 '22 at 06:55
@srikanth2016 you need to extract the ROI then perform OCR on the ROI. Tesseract has no idea what all the other noise contours are so you have to filter them all out first. See my answer — nathancy, May 11 '22 at 07:20

score 2 · Answer 1 · answered May 11 '22 at 07:19

Before performing OCR on an image, it's important to preprocess the image. The idea is to obtain a processed image where the text to extract is in black with the background in white. For this specific image, we need to obtain the ROI before we can OCR.

To do this, we can convert to grayscale, apply a slight Gaussian blur, then adaptive threshold to obtain a binary image. From here, we can apply morphological closing to merge individual letters together. Next we find contours, filter using contour area filtering, and then extract the ROI. We perform text extraction using the --psm 6 configuration option to assume a single uniform block of text. Take a look here for more options.

Detected ROI

Extracted ROI

Result from Pytesseract OCR

TM10=50%L

Code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Grayscale, Gaussian blur, Adaptive threshold
image = cv2.imread('1.jpg')
original = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.adaptiveThreshold(blur, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 5, 5)

# Perform morph close to merge letters together
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=3)

# Find contours, contour area filtering, extract ROI
cnts, _ = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[-2:]
for c in cnts:
    area = cv2.contourArea(c)
    if area > 1800 and area < 2500:
        x,y,w,h = cv2.boundingRect(c)
        ROI = original[y:y+h, x:x+w]
        cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 3)

# Perform text extraction
ROI = cv2.GaussianBlur(ROI, (3,3), 0)
data = pytesseract.image_to_string(ROI, lang='eng', config='--psm 6')
print(data)

cv2.imshow('ROI', ROI)
cv2.imshow('close', close)
cv2.imshow('image', image)
cv2.waitKey()

It sounds a great answer. I am wondering if the area thing would blow in case there are many such texts in the picture. Something like: `TM 10-30%L` at the top somewhere. `TM 10-40%L` at the middle somewhere. `TM 10-50%L` at the end. I can clarify my question if it is not self explanatory. — Hemant Bhargava, May 11 '22 at 09:19
@HemantBhargava it may, the answer was designed for this specific image. To make it more robust, you could add in aspect ratio filtering as well. There's no single solution that would work for all cases using simple image processing techniques. You would have to train your own custom deep/machine learning model to handle all cases — nathancy, May 11 '22 at 09:44

OCR not performing well on clean image | Python Pytesseract

1 Answers1