How to clean images before OCR with Python OpenCV?

Question

I've been trying to clear images for OCR: (the lines)

I need to remove these lines to sometimes further process the image and I'm getting pretty close but a lot of the time the threshold takes away too much from the text:

    copy = img.copy()
    blur = cv2.GaussianBlur(copy, (9,9), 0)
    thresh = cv2.adaptiveThreshold(blur,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,11,30)

    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9,9))
    dilate = cv2.dilate(thresh, kernel, iterations=2)

    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]

    for c in cnts:
        area = cv2.contourArea(c)
        if area > 300:
            x,y,w,h = cv2.boundingRect(c)
            cv2.rectangle(copy, (x, y), (x + w, y + h), (36,255,12), 3)

Edit: Additionally, using constant numbers will not work in case the font changes. Is there a generic way to do this?

Some of these lines, or fragments of them, have the same characteristics as legal text, and it will be difficult to get rid of them without spoiling valid text. If this applies, you might focus on the facts that they are longer than characters, and somewhat isolated. So a first step could be to estimate the size and closeness of characters. — , Dec 03 '19 at 15:50
@YvesDaoust How would one go about finding the closeness of characters? (since filtering purely on size gets mixed up with the characters a lot of the time) — K41F4r, Dec 05 '19 at 11:02
You could find, for every blob, the distance to its closest neighbor. Then by histogram analysis of the distances, you would find a threshold between "close" and "apart" (something like the mode of the distribution), or between "surrounded" and "isolated". — , Dec 05 '19 at 11:24
In case of multiple small lines near each other wouldn't their closest neighbor be the other small line? Would calculating average distance to all other blobs be too costly? — K41F4r, Dec 05 '19 at 13:01
"wouldn't their closest neighbor be the other small line?": good objection, your Honor. In fact a bunch of close short segments do not differ from legit text, though in a completely unlikely arrangement. You may have to regroup the fragments of broken lines. I am not sure that the average distance to all would rescue you. — , Dec 05 '19 at 13:13
Another option is to classify the pieces as line-fragment-like and character-like and discard the line-fragment-like that don't have close character-like neighbors. You can also perform OCR blindly on all fragments and use the recognition results combined to distance to discard clutter. — , Dec 05 '19 at 13:18

nathancy · Accepted Answer · 2021-09-19T23:59:21.183

Here's an idea. We break this problem up into several steps:

Determine average rectangular contour area. We threshold then find contours and filter using the bounding rectangle area of the contour. The reason we do this is because of the observation that any typical character will only be so big whereas large noise will span a larger rectangular area. We then determine the average area.
Remove large outlier contours. We iterate through contours again and remove the large contours if they are 5x larger than the average contour area by filling in the contour. Instead of using a fixed threshold area, we use this dynamic threshold for more robustness.
Dilate with a vertical kernel to connect characters. The idea is take advantage of the observation that characters are aligned in columns. By dilating with a vertical kernel we connect the text together so noise will not be included in this combined contour.
Remove small noise. Now that the text to keep is connected, we find contours and remove any contours smaller than 4x the average contour area.
Bitwise-and to reconstruct image. Since we only have desired contours to keep on our mask, we bitwise-and to preserve the text and get our result.

Here's a visualization of the process:

We Otsu's threshold to obtain a binary image then find contours to determine the average rectangular contour area. From here we remove the large outlier contours highlighted in green by filling contours

Next we construct a vertical kernel and dilate to connect the characters. This step connects all the desired text to keep and isolates the noise into individual blobs.

Now we find contours and filter using contour area to remove the small noise

Here are all the removed noise particles highlighted in green

Result

Code

import cv2

# Load image, grayscale, and Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Determine average contour area
average_area = [] 
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    area = w * h
    average_area.append(area)

average = sum(average_area) / len(average_area)

# Remove large lines if contour area is 5x bigger then average contour area
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    area = w * h
    if area > average * 5:  
        cv2.drawContours(thresh, [c], -1, (0,0,0), -1)

# Dilate with vertical kernel to connect characters
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2,5))
dilate = cv2.dilate(thresh, kernel, iterations=3)

# Remove small noise if contour area is smaller than 4x average
cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    area = cv2.contourArea(c)
    if area < average * 4:
        cv2.drawContours(dilate, [c], -1, (0,0,0), -1)

# Bitwise mask with input image
result = cv2.bitwise_and(image, image, mask=dilate)
result[dilate==0] = (255,255,255)

cv2.imshow('result', result)
cv2.imshow('dilate', dilate)
cv2.imshow('thresh', thresh)
cv2.waitKey()

Note: Traditional image processing is limited to thresholding, morphological operations, and contour filtering (contour approximation, area, aspect ratio, or blob detection). Since input images can vary based on character text size, finding a singular solution is quite difficult. You may want to look into training your own classifier with machine/deep learning for a dynamic solution.

Yes it may, so you would have to adjust the threshold area value. For a more dynamic approach an idea is to determine the average character area then use that as the threshold — nathancy, Dec 04 '19 at 00:01
Seems to be too specific to the example, using the average area will still delete the text a lot of the time which worsens the result for OCR — K41F4r, Dec 05 '19 at 10:52
Do you have another example input image you could add to the post? — nathancy, Dec 05 '19 at 20:47
@C493d check the update, it now uses dynamic character area to determine which noise contours to remove so it should work with images where the characters have larger font. — nathancy, Dec 05 '19 at 22:00
I tried the same (median, IQR) with mixed results. The changes you made delete text when taking these examples one at a time. — K41F4r, Dec 05 '19 at 23:46
Finding a solution that works in all situations using traditional image processing techniques is quite difficult. You may want to look into training your own classifier using deep learning. Good luck! — nathancy, Dec 06 '19 at 00:55

How to clean images before OCR with Python OpenCV?

1 Answers1

Linked