3

I'm trying to do an Arabic OCR using Tesseract, but the OCR doesn't work unless the letters are filled with black color. How do I fill the gaps after Canny edge detection?

Here is a sample image and sample code: enter image description here

import tesserocr
from PIL import Image
import pytesseract
import matplotlib as plt
import cv2
import imutils
import numpy as np

image = cv2.imread(r'c:\ahmed\test3.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

gray = cv2.bilateralFilter(gray,30,40,40)
#gray = cv2.GaussianBlur(gray,(1,1), 0)
gray =cv2.fastNlMeansDenoising(gray ,None, 4, 7, 21)

image = cv2.adaptiveThreshold(gray,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY,11,2)
k = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 1))

blur = cv2.medianBlur(image,3)
erode = cv2.erode(blur, k)
dilat = cv2.dilate(erode,k)
cv2.imshow("gray", dilat)

#cv2.imshow("dilation", img_dilation)
#thresh = cv2.Canny(thresh, 70, 200)

#crop_img = gray[215:215+315, 783:783+684]
#cv2.imshow("cropped", crop_img)

#resize = imutils.resize(blur, width = 460)
#cv2.imshow("resize", resize)

text = pytesseract.image_to_string(dilat, lang='ara')
print(text)
with open(r"c:\ahmed\file.txt", "w", encoding="utf-8") as myfile:
    myfile.write(text)
cv2.waitKey(0)

Result: enter image description here

This is a sample image that won't work with neither thresholding nor Canny.

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
chris burgees
  • 63
  • 1
  • 2
  • 8

1 Answers1

2

In this case, because the text is black, it is best to simply find all the black pixels.

One very simple way to accomplish this using NumPy is as follows:

import matplotlib.pyplot as pp
import numpy as np

image = pp.imread(r'/home/cris/tmp/Zuv3p.jpg')
bin = np.all(image<100, axis=2)

What this does is find all pixels where all three channels are below a value of 100. I picked the threshold of 100 sort of randomly, there probably are better ways to pick a threshold. :)


Notes:

1- When working with color input, converting to gray-value image as first step is usually a bad idea. This throws away a lot of information. Sometimes it's appropriate, but in this case it is better not to.

2- Edge detection is really nice, but is usually the wrong approach. Use edge detection when you need to find edges. Use something else when you don't want just the edges.


Edit: If for some reason np.all complains about the data type (it doesn't for me), you should be able to convert its input to the right type:

bin = np.all(np.array(image<100, dtype=np.bool), axis=2)

or maybe

bin = np.all(np.array(image<100, dtype=np.uint8), axis=2)
Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
  • Yes but there is a greentext above which needs to be captured by OCR too... Can you write a full pipeline ? – chris burgees Nov 26 '18 at 21:15
  • @chrisburgees: I'm sure you can take it from here. Find some thresholds that find pixels of the appropriate color. – Cris Luengo Nov 26 '18 at 21:16
  • but how would I use thresholding without converting to grayvalue ? – chris burgees Nov 26 '18 at 21:17
  • please try to propose a full pipeline with PyDip :) – chris burgees Nov 26 '18 at 21:19
  • The trick is to threshold each channel independently. Each channel can be seen as a grey-value image. Then you combine the threshold results. That is what I did here: each channel is thresholded with `<100`, then I combine the channels with logical AND (which is what `np.all` does). – Cris Luengo Nov 26 '18 at 21:23
  • TypeError: mat data type = 0 is not supported when running the np.all.. Second question, how would I find the best threshold ? what if there is a blurry image, a flash light ? – chris burgees Nov 26 '18 at 21:28
  • @chrisburgees: I don't see this error. I'm using Python3, Matplotlib and NumPy as installed with Ubuntu 16, which is likely several years old. – Cris Luengo Nov 26 '18 at 21:33
  • @chrisburgees: Finding a threshold is not trivial, and especially if you need it to work across many different lighting conditions. I don't have good suggestions there for you, sorry. Blurriness would not affect the threshold you select, but will affect the quality of the OCR. You should try to avoid that. You should certainly also try to avoid reflections. You can detect those because you get pixels that are over-exposed. Hopefully you can reject the image if the imaging conditions are bad. – Cris Luengo Nov 26 '18 at 21:36
  • Can you fix the syntax of np.all ? I want to imshow it please ? – chris burgees Nov 26 '18 at 21:40
  • @chrisburgees: Edited the answer with my best guess for how to fix your error. I don't get this error, so cannot test the solution. The code as written works for me. – Cris Luengo Nov 26 '18 at 21:50
  • YOUR SOULTION WORKS AMAZINGLY!. Thanks so much!. But I have one issue, I can't get the threshold of the green text – chris burgees Nov 26 '18 at 23:12
  • Can you tell me methods for finding the best threshold ? Forget about many different lighting conditions ? – chris burgees Dec 04 '18 at 14:22