Python OCR: ignore signatures in documents

Question

I'm trying to do OCR of a scanned document which has handwritten signatures in it. See the image below.

My question is simple, is there a way to still extract the names of the people using OCR while ignoring the signatures? When I run Tesseract OCR it fails to retrieve the names. I tried grayscaling/blurring/thresholding, using the code below, but without luck. Any suggestions?

image = cv2.imread(file_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
image = cv2.GaussianBlur(image, (5, 5), 0)
image = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]

Sandipan Dey · Answer 1 · 2018-08-29T09:02:19.420

You can use scikit-image's Gaussian filter to blur thin lines first (with an appropriate sigma), followed by binarization of image (e.g., with some thresholding function), then by morphological operations (such as remove_small_objects or opening with some appropriate structure), to remove the signatures mostly and then try classification of the digits with sliding window (assuming that one is already trained with some blurred characters as in the test image). The following shows an example.

from skimage.morphology import binary_opening, square
from skimage.filters import threshold_minimum
from skimage.io import imread
from skimage.color import rgb2gray
from skimage.filters import gaussian

im = gaussian(rgb2gray(imread('lettersig.jpg')), sigma=2)
thresh = threshold_minimum(im)
im = im > thresh
im = im.astype(np.bool)
plt.figure(figsize=(20,20))
im1 = binary_opening(im, square(3))
plt.imshow(im1)
plt.axis('off')
plt.show()

[EDIT]: Use Deep Learning Models

Another option is to pose the problem as an object detection problem where the alphabets are objects. We can use deep learning: CNN/RNN/Fast RNN models (with tensorflow/keras) for object detection or Yolo model (refer to the this article for car detection with yolo model).

in that case we need to train the classifier with noisier images — Sandipan Dey, Aug 28 '18 at 18:45

score 1 · Answer 2 · answered Aug 27 '18 at 19:34

I suppose the input pictures are grayscale, otherwise maybe the different color of the ink could have a distinctive power.

The problem here is that, your training set - I guess - contains almost only 'normal' letters, without the disturbance of the signature - so naturally the classifier won't work on letters with the ink of signature on them. One way to go could be to extend the training set with letters of this type. Of course it is quite a job to extract and label these letters one-by-one.

You can use real letters with different signatures on them, but it might be also possible to artificially generate similar letters. You just need different letters with different snippets of signatures moved above them. This process might be automated.

score 1 · Answer 3 · answered Aug 28 '18 at 07:24

You may try to preprocess the image with morphologic operations.

You can try opening to remove the thin lines of the signature. The problem is that it may remove the punctuation as well.

image = cv2.imread(file_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(5,5))
image = cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)

You may have to alter the kernel size or shape. Just try different sets.

K P · Answer 4 · 2018-08-28T10:21:47.297

0

You can try other OCR providers for the same task. For example, https://cloud.google.com/vision/ try this. You can upload an image and check for free.

You will get a response from API from where you can extract the text which you need. Documentation for extracting that text is also given on the same webpage.

Check out this. this will help you in fetching that text. this is my own answer when I faced the same problem. Convert Google Vision API response to JSON

edited Aug 28 '18 at 10:21

answered Aug 28 '18 at 07:14

K P

854
7
19

2

Tesseract OCR is not broken and still maintained. You just have to know what it can do and what not. Tesseract OCR needs a good segmented image where the characters are visible without noise. The last commit to the Tesseract repo is not even a day old... – Mailerdaimon Aug 28 '18 at 07:29
how does this answer the question at all? Care to explain? – Jeru Luke Aug 29 '18 at 10:54
he asked if there is any way to still extract text using OCR while ignoring the signature, so I gave him one. – K P Aug 29 '18 at 12:19

Python OCR: ignore signatures in documents

4 Answers4