How do i remove highlight from a printed text book in python?

Question

I'm working on extracting the highlighted text from a text book. I have already done locating the highlight and extracting the text inside. To deal with the highlight, I converted the image to grayscale and used OTSU threshold to remove the background highlighted color. This works great when the highlight is a light color like yellow or green but when the highlight is a dark color, the thresholding fails and i get black background covering most of the text which hinders the ocr reading.

I have tried normalising the brightness but it does not seem to work.

What I need is some way to determine the foreground and background color and then remove the background color. Or I need some way to dynamically threshold the image to get black text and white background.

        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        normalized_gray = cv2.equalizeHist(gray)
        (thresh, processed_image) = cv2.threshold(normalized_gray, 127, 255, cv2.THRESH_OTSU)

The test image: https://ibb.co/856YtMx

Some test result:

When i run equalizeHist before thresholding. https://ibb.co/HT0jpKW

When i run equalizeHist after thresholding. https://ibb.co/ZXSz97J

When i use a Binary threshold, the text are blown away: https://ibb.co/DLXywXz

user898678 · Answer 1 · 2019-08-22T19:57:56.393

0

e.g. something like this should work:

import cv2
import numpy as np

image = cv2.imread('photo-2019-08-12-12-44-59.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# adjust contrast
gray_contract = cv2.multiply(gray, 1.5)

# create a kernel for the erode
kernel = np.ones((2, 2), np.uint8)
img_eroded = cv2.erode(gray_contract, kernel, iterations=1)

# binarize with otsu
(thresh, otsu) = cv2.threshold(img_eroded, 127, 255,
                               cv2.THRESH_BINARY+cv2.THRESH_OTSU)

Also you can have a look at post How to remove shadow from scanned images using OpenCV

edited Aug 22 '19 at 19:57

answered Aug 22 '19 at 19:39

user898678

2,994
2
18
17

You should include a before image if you are going to include an after image. – Dan D. Aug 22 '19 at 21:08
@DanD.: I did not understand what you try to say with your comment. – user898678 Aug 23 '19 at 06:49
# adjust contrast gray_contract = cv2.multiply(gray, 1.5) This worked getting rid the background of this single image only because 1.5 is static value, is there a way to dynamically set this value? eg: I need about 2.2 for this image https://ibb.co/Tbx78B3 – tired_coder Aug 23 '19 at 06:51
It is easier to tell if a transform does the right thing if one can see both the input and output. – Dan D. Aug 23 '19 at 13:44
@DanD.: I used input (The test image) provided in question. I see no result to put it here once again. – user898678 Aug 23 '19 at 18:12
@tired_coder: try to use some routine for normalization and fixing of uneven illumination. E. g. have a look at leptonica program: [livre_adapt][1] which is example of that. Here is example what can be achieved with it: https://ibb.co/mFMzTW6 [1]: https://github.com/DanBloomberg/leptonica/blob/master/prog/livre_adapt.c – user898678 Aug 25 '19 at 18:22

lucians · Answer 2 · 2019-09-04T17:38:20.470

Adaptive threshold is what you need here.

My output with code. Can be fine tuned.

import cv2
import numpy as np

img = cv2.imread("high.jpg")

img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

gaus = cv2.adaptiveThreshold(img_gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 20)

cv2.imshow("Gaussian", gaus)
cv2.waitKey(0)

cv2.imwrite('output.png', gaus)

UPDATE

Changed parameters to the adaptiveThreshold function, the second image you posted.

gaus = cv2.adaptiveThreshold(img_gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 8)

How do i remove highlight from a printed text book in python?

2 Answers2