How to remove background noise in image without damaging text?

Question

I am processing images for tesseract's ocr. I need help to get rid of the background noise without damaging the text.

Example input image

This is an example image

I have tried median blurring and removing small connected components (How do I remove the dots / noise without damaging the text?). The problem with connected components is that the noise can have larger connections and I cannot get rid of it without also removing the minus sign. Any suggestion how to move forward?

You can try to apply the open morphological transformation: erosion followed by dilation using a kernel with 1s in the middle rows and zeros at top and bottom (the minus sign will not be removed because the transformation is "applied horizontally"). Read this: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_morphological_ops/py_morphological_ops.html#morphological-ops — Victor Ruiz, Jul 28 '19 at 13:09
Thanks for the quick response! I found a way using erosion followed by removal of small connected components — ssam54932, Jul 28 '19 at 21:26

nathancy · Answer 1 · 2019-07-30T02:13:25.777

Since your image is only black/white, you can do simple thresholding and morphological transformations to filter the image. If your image input was not black and white, you could do blurring techniques such as cv2.medianBlur() or cv2.GaussianBlur() to smooth the image as a preprocessing step. Then you could perform morphological operations with various kernel sizes or construct custom kernels with cv2.getStructuringElement(). Generally, a larger kernel size (7x7 or 9x9) will remove more noise but also remove the desired details as opposed to a smaller kernel (3x3 or 5x5). There is a trade off depending on how much noise you want to remove while balancing the amount of details to preserve. Take a look at this answer for colored captchas.

Threshold

Morph close

Invert image for Tesseract

Result

-63 164

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image = cv2.imread('1.png',0)
thresh = cv2.threshold(image, 150, 255, cv2.THRESH_BINARY_INV)[1]

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel)

result = 255 - opening
cv2.imshow('thresh', thresh)
cv2.imshow('opening', opening)
cv2.imshow('result', result)

print(pytesseract.image_to_string(result))
cv2.waitKey()

How to remove background noise in image without damaging text?

1 Answers1