How do I remove the dots / noise without damaging the text?

Question

I'm processing the images with OpenCV and Python. I need to remove the dots / noise from the image.
I tried dilation which made the dots smaller, however the text is being damaged. I also tried looping dilation twice and erosion once. But this did not give satisfactory results.
Is there some other way I can achieve this?
Thank you :)

EDIT:
I'm new to image processing. My current code is as follows

image = cv2.imread(file)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
kernel = np.ones((2, 2), np.uint8)
gray = cv2.GaussianBlur(gray, (5, 5), 0)
gray = cv2.GaussianBlur(gray, (5, 5), 0)
gray = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
gray = cv2.erode(gray, kernel, iterations=1)
gray = cv2.dilate(gray, kernel, iterations=1)
cv2.imwrite(file.split('.'[0]+"_process.TIF", gray))

EDIT 2:
I tried median blurring. It has solved 90% of the issue. I had been using gaussianBlurring all this while.
Thank you

@api55 Damn. That made a humongous change! (There is a little left though). I've been using gaussianBlurring all this while. Thanks! — Praveen Kumar, Feb 08 '18 at 09:43
median filter is usually good for THIS kind of noise where the surrounding pixels are white (salt and pepper noise). The little noise, probably you will have to filter in other ways, like eroding dilating, just remember that blurring (with gaussian) may make the points bigger and wont have a good effect and will also blur the letters — api55, Feb 08 '18 at 09:58
maybe you should read a book on image processing fundamentals... at least the first few chapters... otherwise you just waste your time trying to solve problems the wrong way — Piglet, Feb 08 '18 at 10:15

Dmitrii Z. · Accepted Answer · 2018-09-28T14:30:42.083

18

How about removing small connected components using connectedComponentsWithStats

import cv2
import numpy as np

img = cv2.imread('path_to_your_image', 0)
_, blackAndWhite = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY_INV)

nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(blackAndWhite, None, None, None, 8, cv2.CV_32S)
sizes = stats[1:, -1] #get CC_STAT_AREA component
img2 = np.zeros((labels.shape), np.uint8)

for i in range(0, nlabels - 1):
    if sizes[i] >= 50:   #filter small dotted regions
        img2[labels == i + 1] = 255

res = cv2.bitwise_not(img2)

cv2.imwrite('res.png', res)

And here is c++ example:

Mat invBinarized;

threshold(inputImage, invBinarized, 127, 255, THRESH_BINARY_INV);
Mat labels, stats, centroids;

auto nlabels = connectedComponentsWithStats(invBinarized, labels, stats, centroids, 8, CV_32S, CCL_WU);

Mat imageWithoutDots(inputImage.rows, inputImage.cols, CV_8UC1, Scalar(0));
for (int i = 1; i < nlabels; i++) {
    if (stats.at<int>(i, 4) >= 50) {
        for (int j = 0; j < imageWithoutDots.total(); j++) {
            if (labels.at<int>(j) == i) {
                imageWithoutDots.data[j] = 255;
            }
        }
    }
}
cv::bitwise_not(imageWithoutDots, imageWithoutDots);

EDIT:
See also

OpenCV documentation for connectedComponentsWithStats

How to use openCV's connected components with stats in python

Example from learning opencv3

edited Sep 28 '18 at 14:30

answered Feb 08 '18 at 10:09

Dmitrii Z.

2,287
3
19
29

1

@DmitriiZ works brilliantly but is extremely slow - do you know of a more numpy-esque way (no for loop) that might be faster, or should I ask it as a question? – jtlz2 Aug 01 '19 at 17:38
1

@jtlz2 I couldn't find any reasonable way to make python code much faster, maybe you can try your luck on codereview.stackexchange.com or maybe even here, i guess there should be a way to make that question on-topic. That said, I wouldn't use connectedComponentsWithStats based approach on non-scaled image if performance really matters, because you in fact recreating the whole image. – Dmitrii Z. Aug 05 '19 at 11:04
@DmitriiZ. Thanks for the reply. I spent a while on it - numba was the only way I found to speed it up in the end - i.e. parallelize - all routes required npix * npix * nregions operations. Thanks for the quick response - much appreciated – jtlz2 Aug 05 '19 at 15:15
It works for my Image file of text, but its SLOW!!!! I also had to add a flag to the open. image1 = cv.imread(fname, cv.IMREAD_GRAYSCALE). Perhaps because my image is black text on a white background. – QuentinJS Jan 19 '22 at 23:51
@QuentinJS this days ppl tend to train neural networks to do cleaning tasks which perform reasonably well. – Dmitrii Z. Jan 20 '22 at 12:24
Can you provide any suggestions/links ? I have thousands of legal docs and it can take up to 30 min per document. Most docs are pretty clean but some are littered with small black dots. I can pose it as a new quest/n – QuentinJS Jan 20 '22 at 14:44
@QuentinJS unless you're fluent in data analysis - I suggest hiring a freelancer. This is not an easy task & heavily depends on your dataset. – Dmitrii Z. Jan 20 '22 at 14:56
Unfortunately all I have is me. Overall the PDF -> text process works, just the noise is an issue – QuentinJS Jan 20 '22 at 15:18

How do I remove the dots / noise without damaging the text?

1 Answers1

Linked