5

I am processing images for tesseract's ocr. I need help to get rid of the background noise without damaging the text.

Example input image

This is an example image

I have tried median blurring and removing small connected components (How do I remove the dots / noise without damaging the text?). The problem with connected components is that the noise can have larger connections and I cannot get rid of it without also removing the minus sign. Any suggestion how to move forward?

nathancy
  • 42,661
  • 14
  • 115
  • 137
ssam54932
  • 51
  • 1
  • 4
  • You can try to apply the open morphological transformation: erosion followed by dilation using a kernel with 1s in the middle rows and zeros at top and bottom (the minus sign will not be removed because the transformation is "applied horizontally"). Read this: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_morphological_ops/py_morphological_ops.html#morphological-ops – Victor Ruiz Jul 28 '19 at 13:09
  • 1
    Try different kernel sizes. 5x5, 7x7, 9x9 for instance – Victor Ruiz Jul 28 '19 at 13:12
  • Thanks for the quick response! I found a way using erosion followed by removal of small connected components – ssam54932 Jul 28 '19 at 21:26

1 Answers1

6

Since your image is only black/white, you can do simple thresholding and morphological transformations to filter the image. If your image input was not black and white, you could do blurring techniques such as cv2.medianBlur() or cv2.GaussianBlur() to smooth the image as a preprocessing step. Then you could perform morphological operations with various kernel sizes or construct custom kernels with cv2.getStructuringElement(). Generally, a larger kernel size (7x7 or 9x9) will remove more noise but also remove the desired details as opposed to a smaller kernel (3x3 or 5x5). There is a trade off depending on how much noise you want to remove while balancing the amount of details to preserve. Take a look at this answer for colored captchas.


Threshold

enter image description here

Morph close

enter image description here

Invert image for Tesseract

enter image description here

Result

-63 164

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image = cv2.imread('1.png',0)
thresh = cv2.threshold(image, 150, 255, cv2.THRESH_BINARY_INV)[1]

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel)

result = 255 - opening
cv2.imshow('thresh', thresh)
cv2.imshow('opening', opening)
cv2.imshow('result', result)

print(pytesseract.image_to_string(result))
cv2.waitKey()
nathancy
  • 42,661
  • 14
  • 115
  • 137