3

i have an image with multiple red rectangle image extraction and output is good.

i'm using https://github.com/autonise/CRAFT-Remade for text-recognition

original:

Original

my image:

IMAGE

i try to extract text only in all rectangle with pytesserac but without success. output result :

r

2

aseeaaei

ae

How we can extract text from this image correctly with accuracy?

part of code:

def saveResult(img_file, img, boxes, dirname='./result/', verticals=None, texts=None):
        """ save text detection result one by one
        Args:
            img_file (str): image file name
            img (array): raw image context
            boxes (array): array of result file
                Shape: [num_detections, 4] for BB output / [num_detections, 4] for QUAD output
        Return:
            None
        """
        img = np.array(img)

        # make result file list
        filename, file_ext = os.path.splitext(os.path.basename(img_file))

        # result directory
        res_file = dirname + "res_" + filename + '.txt'
        res_img_file = dirname + "res_" + filename + '.jpg'

        if not os.path.isdir(dirname):
            os.mkdir(dirname)

        with open(res_file, 'w') as f:
            for i, box in enumerate(boxes):
                poly = np.array(box).astype(np.int32).reshape((-1))
                strResult = ','.join([str(p) for p in poly]) + '\r\n'
                f.write(strResult)

                poly = poly.reshape(-1, 2)
                cv2.polylines(img, [poly.reshape((-1, 1, 2))], True, color=(0, 0, 255), thickness=2) # HERE
                ptColor = (0, 255, 255)
                if verticals is not None:
                    if verticals[i]:
                        ptColor = (255, 0, 0)

                if texts is not None:
                    font = cv2.FONT_HERSHEY_SIMPLEX
                    font_scale = 0.5
                    cv2.putText(img, "{}".format(texts[i]), (poly[0][0]+1, poly[0][1]+1), font, font_scale, (0, 0, 0), thickness=1)
                    cv2.putText(img, "{}".format(texts[i]), tuple(poly[0]), font, font_scale, (0, 255, 255), thickness=1)

        # Save result image
        cv2.imwrite(res_img_file, img)

after your comment, here's result:

Modified

and tesseract result good for first test but not accuracy :

400
300
200

“2615

1950



24
16
nathancy
  • 42,661
  • 14
  • 115
  • 137
Kate
  • 320
  • 1
  • 6
  • 19
  • Could you add your original image? You most likely have to perform preprocessing on the image to get good output results. The foreground text should be in black while the background in white – nathancy Oct 24 '19 at 20:02
  • @nathancy of course, here's the original image. how i can do this? – Kate Oct 24 '19 at 20:11
  • 2
    Since you already have the bounding box of each number, you can invert the ROI then threshold so the text is in black and the background in white. From here throw each ROI into Pytesseract. If its possible you should also add in the code you use to generate the red rectangles – nathancy Oct 24 '19 at 20:41
  • @nathancy, post edited with the result after your comment! i don't know why tesseract don't extract correctly? but with the black text is the good way :) – Kate Oct 24 '19 at 21:14
  • your output image looks quite faded so Pytesseract might have trouble. Think about it, if a human can barely identify the text, how can the computer identify it? It seems like the result is better but you may have to use a different configuration setting. Take a look at my answer below – nathancy Oct 24 '19 at 21:31

1 Answers1

2

When using Pytesseract to extract text, preprocessing the image is extremely important. In general, we want to preprocess the text such that the desired text to extract is black with the background in white. To do this, we can use Otsu's threshold to obtain a binary image then perform morphological operations to filter and remove noise. Here's a pipeline:

  • Convert image to grayscale and resize
  • Otsu's threshold for binary image
  • Invert image and perform morphological operations
  • Find contours
  • Filter using contour approximation, aspect ratio, and contour area
  • Remove unwanted noise
  • Perform text recognition

After converting to grayscale, we resize the image using imutils.resize() then Otsu's threshold for a binary image. The image is now in only black or white but there is still unwanted noise

From here we invert the image and perform morphological operations with a horizontal kernel. This step merges the text into a single contour where we can filter and remove the unwanted lines and small blobs

Now we find contours and filter using a combination of contour approximation, aspect ratio, and contour area to isolate the unwanted sections. The removed noise is highlighted in green

Now that the noise is removed, we invert the image again to have the desired text in black then perform text extraction. I've also noticed that adding in a slight blur enhances recognition. Here's the cleaned image we perform text extraction on

We give Pytesseract the --psm 6 configuration since we want to treat the image as a uniform block of text. Here's the result from Pytesseract

6745 63 6 10.50
2245 21 18 17
525 4 22 0.18
400 4 a 0.50
300 3 4 0.75
200 2 3 0.22
2575 24 3 0.77
1950 ii 12 133

The output isn't perfect but its close. You can experiment with additional configuration settings here

import cv2
import pytesseract
import imutils

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Resize, grayscale, Otsu's threshold
image = cv2.imread('1.png')
image = imutils.resize(image, width=500)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Invert image and perform morphological operations
inverted = 255 - thresh
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15,3))
close = cv2.morphologyEx(inverted, cv2.MORPH_CLOSE, kernel, iterations=1)

# Find contours and filter using aspect ratio and area
cnts = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    area = cv2.contourArea(c)
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.01 * peri, True)
    x,y,w,h = cv2.boundingRect(approx)
    aspect_ratio = w / float(h)
    if (aspect_ratio >= 2.5 or area < 75):
        cv2.drawContours(thresh, [c], -1, (255,255,255), -1)

# Blur and perform text extraction
thresh = cv2.GaussianBlur(thresh, (3,3), 0)
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)

cv2.imshow('close', close)
cv2.imshow('thresh', thresh)
cv2.waitKey()
nathancy
  • 42,661
  • 14
  • 115
  • 137
  • Wow, Interesting approach! Thank you, the last image is clean and font in bold is perfect for extraction. – Kate Oct 24 '19 at 21:47
  • In the tessdata directory, there is another directory called config, which contains predefined config parameter scripts to ease your recognitions tasks. one of the scripts called digits, which limits your recognition character set to digits only, so it would tremendously help with digits only recognition. – Erdogan Kurtur Oct 26 '20 at 20:55