1

I've gotten access to a lot of reports which are filled out by hand. One of the columns in the report contains a timestamp, which I would like to attempt to identify without going through each report manually.

I am playing with the idea of splitting the times, e.g. 00:30, into four digits, and running these through a classifier trained on MNIST to identify the actual timestamps.

When I manually extract the four digits in Photoshop and run these through an MNIST classifier, it works perfectly. But so far I haven't been able to figure out how to programatically split the number sequences into single digits. I tried to use different types of countour finding in OpenCV, but it didn't work very reliably.

Any suggestions?

I've added a screenshot of some of the relevant columns in the reports.

Henrik Lied
  • 13
  • 1
  • 3
  • What's the column you want to work on ? – lucians May 14 '18 at 10:29
  • 1
    Since the four digits lie between two horizontal lines, I'd suggest using a line detector and then extracting the area corresponding to the quadrilateral made by the combination. Then you can try running MNIST on these areas. Since you haven't posted examples of your manually cropped images or your programming language, its difficult to suggest the code. – Rick M. May 14 '18 at 10:43
  • @Rick, I'm of the impression that most classifiers trained on MNIST only accept single digits as input. Am I wrong in that assertion? I do most of my work in Python, so if you have any examples or hints as to what I should look at, please let me know! – Henrik Lied May 14 '18 at 12:18
  • @Link Sorry if the included screenshot was difficult to understand. It's simply a collage of three different reports, but the same column. Just wanted to give you guys as many examples of the handwriting as I could. – Henrik Lied May 14 '18 at 12:20
  • The image you posted is the original you have or you resized it? Because all is in very poor resolution... – lucians May 15 '18 at 07:59

2 Answers2

0

Breaking up text into individual characters is not as easy as it sounds at first. You can try to find some rules and manipulate the image by that, but there will be just too many exceptions. For example you can try to find disjoint marks, but the fourth one in your image, 0715 has it's "5" broken up into three pieces, and the 9th one, 17.00 has the two zeros overlapping.

You are very lucky with the horizontal lines - at least it's easy to separate different entries. But you have to come up with a lot of ideas related to semi-fixed character width, a "soft" disjointness rule, etc.

I did a project like that two years ago and we ended up using an external open source library called Tesseract. Here's this article of Roman numerals recognition with it, up to about 90% accuracy. You might also want to look into the Lipi Toolkit, but I have no experience with that.

You might also want to consider to just train a network to recognize the four digits at once. So the input would be the whole field with the four handwritten digits and the output would be the four numbers. And let the network sort out where the characters are. If you have enough training data, that's probably the easiest approach.

EDIT: Inspired by @Link's answer, I just came up with this idea, you can give it a try. Once you extracted the area between the two lines, trim the image to get rid of white space all around. Then make an educated guess about how big the characters are. Use maybe the height of the area? Then create a sliding window over the image, and run the recognition all the way. There will most likely be four peaks which would correspond to the four digits.

Peter Szoldan
  • 4,792
  • 1
  • 14
  • 24
  • Isn't a bit difficult to to the last step you proposed ? I am asking because I am interested in process you could suggest. Thanks. – lucians May 14 '18 at 13:18
  • Well it will require significant effort either way. The last one's difficulty is getting training data. If you can convince the people who gave you the sheets to type in a few thousand of those sheets, you're probably okay. If not, you should probably go with one of the libraries I mentioned above. – Peter Szoldan May 14 '18 at 14:18
  • Doing image augmentation on digits would solve/reduce the problem ? – lucians May 14 '18 at 15:07
  • Reduce yes, solve no. You can't unlink the two zeros, for example, can't think of any augmentation that would do that. BTW, added an idea to my answer inspired by your comment and answer. – Peter Szoldan May 14 '18 at 15:15
0

I would do something like this (no code as long as it is just an idea, you could test it to see if works):

  1. Extract each area for each group of numbers as Rick M. suggested above. So you will have many Kl [hour] rectangles under image form.

  2. For each of these rectangles extract (using OpenCV contours feature) each ROI. Delete Kl if you don't need it (you know the dimensions of this ROI (you can calculate it with img.shape) and they have more or less the same dimensions)

  3. Extract all digits using the same script used above. You can take a look at my questions/answers to find some pieces of code which do this. You will have a problem with underline in some cases. Search about this on SO, there are few solutions complete with code.

  4. Now, about splitting up. We know the ROI's are in hour format, so hh:mm (or 4 digits). A simply (and very rudimental) solution to split chars wich are attached between would be to split in half the ROI you get with 2 digits inside. It's a raw solution but should perform well in your case because the digits attached are just 2.

  5. Some digits will output with "missing pieces". This can be avoided by using some erosion/dilation/skeletonization.

Here you don't have letters, only numbers so MNIST should work well (not perfect, keep this in mind).

In a few, extracting the data it's not the hard task but recognizing the digits will make you sweat a bit.

I hope I can provide some code to show the steps above as soon as possible.

EDIT - code

This is some code I made. Final output is this:

output

The code works 100% with this image so, if something don't work for you, check folders/paths/modules installation.

Hope this helped.

import cv2
import numpy as np

# 1 - remove the vertical line on the left

img = cv2.imread('image.jpg', 0)
# gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(img, 100, 150, apertureSize=5)

lines = cv2.HoughLines(edges, 1, np.pi / 50, 50)
for rho, theta in lines[0]:
    a = np.cos(theta)
    b = np.sin(theta)
    x0 = a * rho
    y0 = b * rho
    x1 = int(x0 + 1000 * (-b))
    y1 = int(y0 + 1000 * (a))
    x2 = int(x0 - 1000 * (-b))
    y2 = int(y0 - 1000 * (a))

    cv2.line(img, (x1, y1), (x2, y2), (255, 255, 255), 10)

cv2.imshow('marked', img)
cv2.waitKey(0)
cv2.imwrite('image.png', img)


# 2 - remove horizontal lines

img = cv2.imread("image.png")
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_orig = cv2.imread("image.png")

img = cv2.bitwise_not(img)
th2 = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 15, -2)
cv2.imshow("th2", th2)
cv2.waitKey(0)
cv2.destroyAllWindows()

horizontal = th2
rows, cols = horizontal.shape

# inverse the image, so that lines are black for masking
horizontal_inv = cv2.bitwise_not(horizontal)
# perform bitwise_and to mask the lines with provided mask
masked_img = cv2.bitwise_and(img, img, mask=horizontal_inv)
# reverse the image back to normal
masked_img_inv = cv2.bitwise_not(masked_img)
cv2.imshow("masked img", masked_img_inv)
cv2.waitKey(0)
cv2.destroyAllWindows()

horizontalsize = int(cols / 30)
horizontalStructure = cv2.getStructuringElement(cv2.MORPH_RECT, (horizontalsize, 1))
horizontal = cv2.erode(horizontal, horizontalStructure, (-1, -1))
horizontal = cv2.dilate(horizontal, horizontalStructure, (-1, -1))
cv2.imshow("horizontal", horizontal)
cv2.waitKey(0)
cv2.destroyAllWindows()

# step1
edges = cv2.adaptiveThreshold(horizontal, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 3, -2)
cv2.imshow("edges", edges)
cv2.waitKey(0)
cv2.destroyAllWindows()

# step2
kernel = np.ones((1, 2), dtype="uint8")
dilated = cv2.dilate(edges, kernel)
cv2.imshow("dilated", dilated)
cv2.waitKey(0)
cv2.destroyAllWindows()

im2, ctrs, hier = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# sort contours
sorted_ctrs = sorted(ctrs, key=lambda ctr: cv2.boundingRect(ctr)[0])

for i, ctr in enumerate(sorted_ctrs):
    # Get bounding box
    x, y, w, h = cv2.boundingRect(ctr)

    # Getting ROI
    roi = img[y:y + h, x:x + w]

    # show ROI
    rect = cv2.rectangle(img_orig, (x, y), (x + w, y + h), (255, 255, 255), -1)

cv2.imshow('areas', rect)
cv2.waitKey(0)

cv2.imwrite('no_lines.png', rect)


# 3 - detect and extract ROI's

image = cv2.imread('no_lines.png')
cv2.imshow('i', image)
cv2.waitKey(0)

# grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow('gray', gray)
cv2.waitKey(0)

# binary
ret, thresh = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY_INV)
cv2.imshow('thresh', thresh)
cv2.waitKey(0)

# dilation
kernel = np.ones((8, 45), np.uint8)  # values set for this image only - need to change for different images
img_dilation = cv2.dilate(thresh, kernel, iterations=1)
cv2.imshow('dilated', img_dilation)
cv2.waitKey(0)

# find contours
im2, ctrs, hier = cv2.findContours(img_dilation.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# sort contours
sorted_ctrs = sorted(ctrs, key=lambda ctr: cv2.boundingRect(ctr)[0])

for i, ctr in enumerate(sorted_ctrs):
    # Get bounding box
    x, y, w, h = cv2.boundingRect(ctr)

    # Getting ROI
    roi = image[y:y + h, x:x + w]

    # show ROI
    # cv2.imshow('segment no:'+str(i),roi)
    cv2.rectangle(image, (x, y), (x + w, y + h), (255, 255, 255), 1)
    # cv2.waitKey(0)

    # save only the ROI's which contain a valid information
    if h > 20 and w > 75:
        cv2.imwrite('roi\\{}.png'.format(i), roi)

cv2.imshow('marked areas', image)
cv2.waitKey(0)

These are next steps:

  1. Understand what I write ;). It's the most important step.

  2. Using pieces of the code above (especially step 3) you can delete remaining Kl in extracted images.

  3. Create folder for each image and extract digits.

  4. Using MNIST, recognize each digit.

lucians
  • 2,239
  • 5
  • 36
  • 64