1

Last week I was asked to build an application for a blind man to programmatically fill out a PDF document. The problem he is having is that if the fields in the document aren't labeled correctly then he is not able to put his signature and other information into the document in the correct place.

My first approach was to attempt to read the document using iTextSharp and then insert his signature into the field which was most likely to be the signature box:

public string[] MassFieldEdit(IDictionary<string, string> userData, string originalDocument, string edittedDocument, bool flatten)
        {
            PdfReader reader = new PdfReader(originalDocument);
            reader.SelectPages("1-" + reader.NumberOfPages.ToString());
            using (PdfStamper stamper = new PdfStamper(reader, new FileStream(edittedDocument, FileMode.Create)))
            {
                AcroFields form = stamper.AcroFields;
                ICollection<string> fieldKeys = form.Fields.Keys;
                List<string> leftover = new List<string>(fieldKeys);
                foreach (string fieldKey in fieldKeys)
                {
                    foreach (KeyValuePair<string, string> s in user)
                    {
                        //Replace Form field with my custom data
                        if (fieldKey.ToLower().Contains(s.Key.ToLower()))
                        {
                            form.SetField(fieldKey, s.Value);
                            leftover.Remove(fieldKey);
                        }
                    }
                }
                //The below will make sure the fields are not editable in
                //the output PDF.
                stamper.FormFlattening = flatten;
                return leftover.ToArray();
            }
        }

This works by taking a dictionary set, the key being a word or phrase, checking that against the PDF fields and then inserting the value into the fields if the field matches the word or phrase in the key.

The signature box before my program edits it.

The signature box after.

But the problem I have now is that if no field exists then although it may have "sign here" right next to the dotted line, there is no way to insert text onto the dotted line without knowing exactly where the dotted line is, nor can my user select the dotted line because that defeats the point of the program.

I have looked at a number of previous questions and answers, including:

I need a way to detect the signature line and then insert his name onto the signature line with more certainty than taking pot shots at field names. Both in situations where a correctly labeled field exists and also in situations where the signature line may be no more than a line of text which says "sign here".

TylerH
  • 20,799
  • 66
  • 75
  • 101
Kris
  • 36
  • 1
  • 8
  • 2
    You are lucky if you get any actual fields and not just a scanned image. But what is the source of the forms? File an ADA complaint and get the forms fixed. – Garr Godfrey Aug 07 '17 at 01:53
  • Some of the examples he has sent me include the TWC Substitute W-9 and Direct Deposit Form and other official documents, I'm looking for a programmatic solution that will work for any type of PDF that contains a signature line though. I have considered an optical character recognition approach but I'd like to know if there is a solution already available before I go down that route. – Kris Aug 07 '17 at 02:01

1 Answers1

1

The robust solution (aka "hard work solution")

  1. Implement IEventListener (iText7 class)
  2. Use IEventListener to get notified of text rendering instructions, and line drawing operations
  3. Rendering instructions do not always appear in logical (reading) order. Fix that by implementing a comparator for these objects
  4. Sort according to comparator
  5. Use language detection to determine the language (n-gram approach is simple, but should suffice)
  6. Dictionary attack. Look for all occurences of words that signify "sign here" in whatever language the document is written in (hence step 5)
  7. In case of multiple candidates, or no candidates, use line rendering instructions to look for likely candidate of the infamous "dotted line"

This approach is not easy, but there is a lot of research into recognition of structural elements in pdf files. In particular, if you run a google scholar search, you'll find loads of helpful article where people have tried detecting tables, lists, paragraphs, etc.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54