HSV color removal/dropout of form fields

Question

I'm writing a system to dropout certain field borders from a form image. The fields may have writing in them which I need to correctly keep even if the handwriting crosses the field border.

I have 2 images: 1 color image (converted to HSV colorspace) and 1 black/white image that line up pixel per pixel (these are produced by a scanner)

I would like to remove (pluck) the field border pixels from the black and white image, given the colors in the color image.

I have an advantage in that I know apriori the exact location of the field, and the widths/heights of the field border lines.

My current implementation consists of (for each field), scanning the field border on the color image and calculating an average HSV value for that field border (since I know exactly where the field border is, I only visit "field border" pixels, but I may also visit a few handwriting pixels if they cross the field border, the idea is that they won't skew the average very much). Once I have an "average" HSV value for the field border, I scan the field border again, and for each pixel compute the following delta function:

enter image description here

If the Delta value between the "current" pixel and the average HSV is less than 0.07 (found empirically) then I set the pixel to white (colors are close together), otherwise I keep the pixel as black.

Here are some examples of a field:

Color Image: enter image description here Black&White Image Non-Dropped Out: Dropped out Black&White Image where Saturation is not used in Equation: Actual Dropped out Black & White Image with formula used in full (using all 3 components H,S & V)

The formula I'm using to get the 3rd dropped out image is the above formula but where I left the Saturation out of the equation (I was just playing around with things).
This this obviously not delicate enough to color variations but the formula is very sensitive to saturation changes (this is mainly caused by JPEG compression artifacts that exist within the image (example artifacts):

enter image description here

I think the 4th example is the best because it's really sensitive to color variations so you're less likely to remove handwriting, but the problem is you're more prone to pick up border because of slight color differences caused by simple scanning or compression artifacts.

What are your thoughts to alleviate some of the color (saturation) variations that occur within the field border, is it to use histograms? with some quantization involved there to reduce number of bins?

I'd like to hear any ideas people have.

Thank you.

Have you tried applying any mean or median filtering to your image? These filters might reduce some of the noise/compression artifacts. — Max Allan, Apr 25 '13 at 17:35
One classic answer to the noise problem is [graph cuts](http://en.wikipedia.org/wiki/Graph_cuts_in_computer_vision). — David Eisenstat, Apr 28 '13 at 17:43
For this particular example, you could easily use just luminance as a threshold. If it's below a luminance of about 50%, it's user input, otherwise it's the form. Is there more to your input than what you have here? (And since you have HSV, you could probably substitute V instead of luminance.) — user1118321, May 03 '13 at 22:20

score 0 · Answer 1 · answered Apr 25 '13 at 18:01

You might get some good results if you apply machine learning techniques to this problem.

For instance, if you want to label every pixel in your image as either a field border or not a field border you could try hand labeling the pixels in a few images, computing a bunch of features (you are currently only using color but I think oriented gradients might give some good results as well) and dump everything into a support vector machine (SVM).

OpenCV provides implementations of SVMs and gradient based features (descriptors) if you are familiar with C++ or Python:

Alternatively Matlab provides code to train SVMs and compute gradient features as well.

score 0 · Answer 2 · answered May 01 '13 at 02:19

I'm not sure that I fully understand your priorities here - the third image looks pretty good to me (much better than the fourth). I do notice that the bottom of the first "S" has a gap.

In any case, as you know the positions of the borders and are scanning those pixels, I suggest compiling statistics on the H, S and V for them. For S and V I suppose you could just calculate mean and standard deviation. Hue is trickier due to the wrap-around nature of angles and that it can be undefined. You could just quantize and find the mode (or a window-weighted mode). You could do the same for the non-white contents of the boxes so you can quantify the nature of pen-strokes vs the box pixels. To narrow your distributions you could discard any pixels that fall outside x SD as outliers and recalculate the mean and SD. From that point you could simply classify a pixel based upon which probability distribution it falls closer to being within.

Optimizations to that would include:

Ignore the H component for low saturations.
When unsure, bias towards border if near known border locations.
When unsure, run a second pass that biases towards pen strokes if there are neighbouring pixels classified already as pen strokes.

HSV color removal/dropout of form fields

2 Answers2