Using YOLO or other image recognition techniques to identify all alphanumeric text present in images

Question

I have multiple images diagram, all of which contains labels as alphanumeric characters instead of just the text label itself. I want my YOLO model to identify all the numbers & alphanumeric characters present in it.

How can I train my YOLO model to do the same. The dataset can be found here. https://drive.google.com/open?id=1iEkGcreFaBIJqUdAADDXJbUrSj99bvoi

For example : see the bounding boxes. I want YOLO to detect wherever the text are present. However currently its not necessary to identify the text inside it.

Also the same needs to be done for these type of images

The images can be downloaded here

This is what I have tried using opencv but it does not work for all the images in the dataset.

import cv2
import numpy as np
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Users\HPO2KOR\AppData\Local\Tesseract-OCR\tesseract.exe"

image = cv2.imread(r'C:\Users\HPO2KOR\Desktop\Work\venv\Patent\PARTICULATE DETECTOR\PD4.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
clean = thresh.copy()

horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15,1))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(clean, [c], -1, 0, 3)

vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,30))
detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(clean, [c], -1, 0, 3)

cnts = cv2.findContours(clean, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    area = cv2.contourArea(c)
    if area < 100:
        cv2.drawContours(clean, [c], -1, 0, 3)
    elif area > 1000:
        cv2.drawContours(clean, [c], -1, 0, -1)
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.02 * peri, True)
    x,y,w,h = cv2.boundingRect(c)
    if len(approx) == 4:
        cv2.rectangle(clean, (x, y), (x + w, y + h), 0, -1)

open_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2,2))
opening = cv2.morphologyEx(clean, cv2.MORPH_OPEN, open_kernel, iterations=2)
close_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,2))
close = cv2.morphologyEx(opening, cv2.MORPH_CLOSE, close_kernel, iterations=4)
cnts = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    area = cv2.contourArea(c)
    if area > 500:
        ROI = image[y:y+h, x:x+w]
        ROI = cv2.GaussianBlur(ROI, (3,3), 0)
        data = pytesseract.image_to_string(ROI, lang='eng',config='--psm 6')
        if data.isalnum():
            cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 2)
            print(data)

cv2.imwrite('image.png', image)
cv2.imwrite('clean.png', clean)
cv2.imwrite('close.png', close)
cv2.imwrite('opening.png', opening)
cv2.waitKey()

Is there any model or any opencv technique or some pre trained model that can do the same for me? I just need the bounding boxes around all the alphanumeric characters present in the images. After that I need to identify whats present in it. However the second part is not important currently.

that does not work for all the images – Feb 25 '20 at 04:24 — , Feb 25 '20 at 04:24

score 8 · Accepted Answer · answered Feb 26 '20 at 02:48

A possible approach is to use the EAST (Efficient and Accurate Scene Text) deep learning text detector based on Zhou et al.’s 2017 paper, EAST: An Efficient and Accurate Scene Text Detector. The model was originally trained for detecting text in natural scene images but it may be possible to apply it on diagram images. EAST is quite robust and is capable of detecting blurred or reflective text. Here is a modified version of Adrian Rosebrock's implementation of EAST. Instead of applying the text detector directly on the image, we can try to remove as much non-text objects on the image before performing text detection. The idea is to remove horizontal lines, vertical lines, and non-text contours (curves, diagonals, circular shapes) before applying detection. Here's the results with some of your images:

Input -> Non-text contours to remove in green

Result

Other images

The pretrained frozen_east_text_detection.pb model necessary to perform text detection can be found here. Although the model catches most of the text, the results are not 100% accurate and has occasional false positives probably due to how it was trained on natural scene images. To obtain more accurate results you would probably have to train your own custom model. But if you want a decent out-of-the-box solution then this should work you. Check out Adrian's OpenCV Text Detection (EAST text detector) blog post for a more comprehensive explanation of the EAST text detector.

Code

from imutils.object_detection import non_max_suppression
import numpy as np
import cv2

def EAST_text_detector(original, image, confidence=0.25):
    # Set the new width and height and determine the changed ratio
    (h, W) = image.shape[:2]
    (newW, newH) = (640, 640)
    rW = W / float(newW)
    rH = h / float(newH)

    # Resize the image and grab the new image dimensions
    image = cv2.resize(image, (newW, newH))
    (h, W) = image.shape[:2]

    # Define the two output layer names for the EAST detector model that
    # we are interested -- the first is the output probabilities and the
    # second can be used to derive the bounding box coordinates of text
    layerNames = [
        "feature_fusion/Conv_7/Sigmoid",
        "feature_fusion/concat_3"]

    net = cv2.dnn.readNet('frozen_east_text_detection.pb')

    # Construct a blob from the image and then perform a forward pass of
    # the model to obtain the two output layer sets
    blob = cv2.dnn.blobFromImage(image, 1.0, (W, h), (123.68, 116.78, 103.94), swapRB=True, crop=False)
    net.setInput(blob)
    (scores, geometry) = net.forward(layerNames)

    # Grab the number of rows and columns from the scores volume, then
    # initialize our set of bounding box rectangles and corresponding
    # confidence scores
    (numRows, numCols) = scores.shape[2:4]
    rects = []
    confidences = []

    # Loop over the number of rows
    for y in range(0, numRows):
        # Extract the scores (probabilities), followed by the geometrical
        # data used to derive potential bounding box coordinates that
        # surround text
        scoresData = scores[0, 0, y]
        xData0 = geometry[0, 0, y]
        xData1 = geometry[0, 1, y]
        xData2 = geometry[0, 2, y]
        xData3 = geometry[0, 3, y]
        anglesData = geometry[0, 4, y]

        # Loop over the number of columns
        for x in range(0, numCols):
            # If our score does not have sufficient probability, ignore it
            if scoresData[x] < confidence:
                continue

            # Compute the offset factor as our resulting feature maps will
            # be 4x smaller than the input image
            (offsetX, offsetY) = (x * 4.0, y * 4.0)

            # Extract the rotation angle for the prediction and then
            # compute the sin and cosine
            angle = anglesData[x]
            cos = np.cos(angle)
            sin = np.sin(angle)

            # Use the geometry volume to derive the width and height of
            # the bounding box
            h = xData0[x] + xData2[x]
            w = xData1[x] + xData3[x]

            # Compute both the starting and ending (x, y)-coordinates for
            # the text prediction bounding box
            endX = int(offsetX + (cos * xData1[x]) + (sin * xData2[x]))
            endY = int(offsetY - (sin * xData1[x]) + (cos * xData2[x]))
            startX = int(endX - w)
            startY = int(endY - h)

            # Add the bounding box coordinates and probability score to
            # our respective lists
            rects.append((startX, startY, endX, endY))
            confidences.append(scoresData[x])

    # Apply non-maxima suppression to suppress weak, overlapping bounding
    # boxes
    boxes = non_max_suppression(np.array(rects), probs=confidences)

    # Loop over the bounding boxes
    for (startX, startY, endX, endY) in boxes:
        # Scale the bounding box coordinates based on the respective
        # ratios
        startX = int(startX * rW)
        startY = int(startY * rH)
        endX = int(endX * rW)
        endY = int(endY * rH)

        # Draw the bounding box on the image
        cv2.rectangle(original, (startX, startY), (endX, endY), (36, 255, 12), 2)
    return original

# Convert to grayscale and Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
clean = thresh.copy()

# Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15,1))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(clean, [c], -1, 0, 3)

# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,30))
detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(clean, [c], -1, 0, 3)

# Remove non-text contours (curves, diagonals, circlar shapes)
cnts = cv2.findContours(clean, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    area = cv2.contourArea(c)
    if area > 1500:
        cv2.drawContours(clean, [c], -1, 0, -1)
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.02 * peri, True)
    x,y,w,h = cv2.boundingRect(c)
    if len(approx) == 4:
        cv2.rectangle(clean, (x, y), (x + w, y + h), 0, -1)

# Bitwise-and with original image to remove contours
filtered = cv2.bitwise_and(image, image, mask=clean)
filtered[clean==0] = (255,255,255)

# Perform EAST text detection
result = EAST_text_detector(image, filtered)

cv2.imshow('filtered', filtered)
cv2.imshow('result', result)
cv2.waitKey()

To this day I'm still surprised the number of people that show up with extremely similar CV questions in a matter of days. It almost look like guys from the same image processing class are seeking for help to finish their homework or just looking for someone to do the homework for them. It's a really bizarre "coincidence". — karlphillip, Feb 26 '20 at 21:40
@karlphillip Maybe this question looks familiar because the OP posted it about a week ago. He pretty much wants a CTRL+C, CTRL+V answer that covers all his cases right out of the box, so, I guess you might see this very same question again in a couple of weeks! — stateMachine, Feb 27 '20 at 01:56

score 6 · Answer 2 · answered Feb 28 '20 at 13:35

6

For convenience sake I'd like to add the package keras_ocr. It can easily be installed with pip, and is based on the CRAFT text detector, which is a bit newer than the EAST detector if I'm not wrong.

Next to the detection it already does some OCR too! The results are as seen below, see this as an alternative, maybe easier to implement, than the accepted answer.

answered Feb 28 '20 at 13:35

Victor Sonck

396
3
5

Hi victor, does it work for at least 70% of my images? – Feb 28 '20 at 13:37
You have not included labels in your dataset. So I can't really tell you how much % of the images it works on, if I don't have a way to verify if it worked by comparing it to a label. However, it's a pip package, so it should be easy enough for you to run it on your dataset and see for yourself :) – Victor Sonck Feb 29 '20 at 09:44

Yuri Feldman · Answer 3 · 2020-02-24T08:45:07.253

What you're describing appears to be OCR (Optical character recognition). One OCR engine I know of is tesseract, although there is also this one from IBM and others.

As YOLO was originally trained for a very different task, to use it for localizing text will likely require to retrain it from scratch. One could try to use existing packages (adapted to your specific setting) for ground truth (although it is worth to remember that the model would generally be only at most as good as the ground truth). Or, perhaps more easily, generate synthetic data for training (i.e. add text in positions you choose to existing drawings then train to localize it).

Alternatively, if all of your target images are structured similar to the above, one could try to create ground truth using classic CV heuristics as you did above to separate/segment out symbols, followed by classification using a CNN trained on MNIST or similar to determine if a given blob contains a symbol.

For the case you do opt for YOLO - there are existing implementations in python, e.g. I had some experience with this one - should be fairly straightforward to set up training with your own ground truth.

Finally, if using YOLO or CNN is not a goal in itself but rather only the solution, any of the above "ground truth" could be used directly as a solution, and not for training a model.

Hope I understood your question correctly

If you can give the code for the same, as this question contains bounty — , Feb 24 '20 at 07:55
the task is to ultimately get the text but I am first trying to identify all the alphanumeric characters in it then use OCR for the same once identified — , Feb 24 '20 at 07:56
None of what I proposed is really an out-of-the-box solution, and algorithmic code would not be short or simple I think, so I'll leave it at the level of ideas :-). p.s. thanks for the upvote! — Yuri Feldman, Feb 24 '20 at 08:38

Using YOLO or other image recognition techniques to identify all alphanumeric text present in images

3 Answers3

Linked