0

I'm creating an OCR application. It extracts handwritten characters from a boxed section in a scanned or photographed printed form, and reads it using a CNN.

It successfully extracts characters using contours, but there are cases where there are lines that are read too as contours. These lines seem to be the result of either mere noise, or leftover pixels when the boxed section is cropped. The boxed section is cropped using contours.

Basically, it works when the form is scanned with a good scanner, saved in PNG format. Otherwise, it won't work as well. I need it to account for JPEG files too and crap camera/scanners.

This is then more of a question of what possible techniques I can use theoretically.

I'd like to either remove lines, or make the code ignore it.

I've tried:

  • "padding" the cropped boxed section by a negative number n. So it instead removes n pixels from each side. This can't be used too much though, as it also eats up the pixels of the character.
  • use morphological operation "close". Modifying the kernel size does almost nothing significant, though.
  • implementing a boxed section area:character area ratio. If the retrieved contour area ratio to the boxed section area is not in the range, it's ignored.

Here's what it looks like:

1 3

4 5

The grey parts outline the detected contours. The numbers indicate the index of the contour, ordered by the order they are detected. Notice there are strips of lines detected too. I want to get rid of this.

Beside the lines interfering with the model and making it spout nonsense trying to interpret these, there are some cases where it also seems to cause this error:

ValueError: cannot reshape array of size 339 into shape (1,28,28,1)

Maybe I'll start with investigating this in the meantime.

mashedpotatoes
  • 395
  • 2
  • 20
  • 1
    Here is two potential solutions: 1) Preprocess the input image to remove random noise with morphological operations before extracting characters so the noise will not be present. If you can't clean the input images then 2) you can use contour filtering to filter the noise out and only extract the desired characters. One way is to use contour approximation to ensure that a "valid" contour has a length of four (a rectangle or square) which weeds out the strips of lines – nathancy Oct 08 '19 at 01:39
  • On 1: I've implemented [this](https://stackoverflow.com/questions/42065405/remove-noise-from-threshold-image-opencv-python) as preprocessing with the morphological operation "close" after. On 2: please clarify, I have contour approximation but it's for identifying boxed sections only, how would it help identify characters when most of them are not really rectangular? – mashedpotatoes Oct 08 '19 at 04:50
  • Ah, thanks. I misunderstood your earlier comment; what I did was use approxPolyDP and ignored contours with <= 2 points. It helped ignore most but there are still some pesky ones that get through and it sometimes ignores characters like "1" too. Adjusting the line segment length helped. I think your approach may have this problem too, though. Lemme try. I also tried [removing the lines using morphological operations](https://stackoverflow.com/questions/46274961/removing-horizontal-lines-in-image-opencv-python-matplotlib?noredirect=1#comment79515464_46274961) – mashedpotatoes Oct 09 '19 at 11:36
  • 1
    Another filter you could add would be to filter using contour area. If the contour is smaller than some minimum then ignore it. Similarly if area was greater than some maximum, then ignore it. Removing the line using morphological operations is more suited for uniform line segments that have the same width. In your images, they seem to be blobs that are horizontal in nature – nathancy Oct 09 '19 at 19:51
  • I already did something similar to that as mentioned in post. I used the contour area to the containing contour area ratio so it will be more robust when the image is distorted or resized or has different resolution due to scanner quality.. – mashedpotatoes Oct 10 '19 at 10:32

0 Answers0