I'm working on extracting the highlighted text from a text book. I have already done locating the highlight and extracting the text inside. To deal with the highlight, I converted the image to grayscale and used OTSU threshold to remove the background highlighted color. This works great when the highlight is a light color like yellow or green but when the highlight is a dark color, the thresholding fails and i get black background covering most of the text which hinders the ocr reading.
I have tried normalising the brightness but it does not seem to work.
What I need is some way to determine the foreground and background color and then remove the background color. Or I need some way to dynamically threshold the image to get black text and white background.
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
normalized_gray = cv2.equalizeHist(gray)
(thresh, processed_image) = cv2.threshold(normalized_gray, 127, 255, cv2.THRESH_OTSU)
The test image: https://ibb.co/856YtMx
Some test result:
When i run equalizeHist before thresholding. https://ibb.co/HT0jpKW
When i run equalizeHist after thresholding. https://ibb.co/ZXSz97J
When i use a Binary threshold, the text are blown away: https://ibb.co/DLXywXz