I'm creating an OCR application. It extracts handwritten characters from a boxed section in a scanned or photographed printed form, and reads it using a CNN.
It successfully extracts characters using contours, but there are cases where there are lines that are read too as contours. These lines seem to be the result of either mere noise, or leftover pixels when the boxed section is cropped. The boxed section is cropped using contours.
Basically, it works when the form is scanned with a good scanner, saved in PNG format. Otherwise, it won't work as well. I need it to account for JPEG files too and crap camera/scanners.
This is then more of a question of what possible techniques I can use theoretically.
I'd like to either remove lines, or make the code ignore it.
I've tried:
- "padding" the cropped boxed section by a negative number n. So it instead removes n pixels from each side. This can't be used too much though, as it also eats up the pixels of the character.
- use morphological operation "close". Modifying the kernel size does almost nothing significant, though.
- implementing a boxed section area:character area ratio. If the retrieved contour area ratio to the boxed section area is not in the range, it's ignored.
Here's what it looks like:
The grey parts outline the detected contours. The numbers indicate the index of the contour, ordered by the order they are detected. Notice there are strips of lines detected too. I want to get rid of this.
Beside the lines interfering with the model and making it spout nonsense trying to interpret these, there are some cases where it also seems to cause this error:
ValueError: cannot reshape array of size 339 into shape (1,28,28,1)
Maybe I'll start with investigating this in the meantime.