1

I would like to make substantive annotations to a PDF. In my particular case, this PDF will contain payroll data, but the data don't tend to be discretely tabular. If it were, the annotations wouldn't really matter.

What I'm looking to make note of are pieces of data as reference to algorithms written to parse these data. For instance, an employee's metadata--name, SSN, account numbers, hours, pay rates, etc.--will be distributed consistently within a single document (mostly) but differently by document source, and sometimes differently within document source (e.g., Payroll Company X may move data fields around a little, for different clients or as they continue iterating on formatting). Annotating will allow for planning the parsing model in advance, as well as being a reference. I would make note of the data of interest, whatever name I'd give it in the parsing data model, it's relative position on the page, etc. I'm thinking a numbered grid labels at the ends of the gridlines and some ungridded callouts. These documents can be complex. See the mockup below.

mockup of annotation grid

Trying to mark up a printed document gets messy quickly, and doesn't allow for refactoring. I made some variously successful attempts using Adobe Acrobat Pro, which has anemic-at-best annotation capability (I am happy to be mistaken). Using Inkscape worked far better, but was still kludgy. I expect Illustrator or any other general-purpose vector application will be similar. I don't have access to Visio but have used its online competitors, like Lucidchart and Draw.io, and functionally they're okay but I can't use them with documents containing PII. I looked at PDF Annotator and Okular, and their annotation engines are more geared for highlighting digital text than diagramming.

Is there some type of application I'm overlooking that would make this easier to achieve? It's entirely possible using a vector-illustration application will be the best fit, but maybe it'd be better to convert the PDF to another document format more amenable to this sort of diagramming.

References: I've read the following SO questions, which are variably related but don't seem to really answer my need: - Systematically annotate a PDF - Annotating Adobe Reader PDFs with math symbols

Daniel Black
  • 968
  • 1
  • 7
  • 11

1 Answers1

0

Depending on how the grid lines are laid out in your PDFs, an OCR recognition library might be useful if it includes the automated detection of tables. For example, I know that the LEAD OCR Engine from the LEADTOOLS Recognition library, which is what I am familiar with since I work for the vendor of this toolkit, has an option to detect tables and draw recognition zones using the AutoDetectCells method.

So a form with gridlines like this: Grid

Would be recognized through table detection like this: Grid-Table-Detected

The zone coordinates can then be used to draw the annotations and used to extract the information within.

Although, since you are looking to ultimately parse the PDFs and extract the employee information from them, you could consider a more direct approach. A recognition library like LEADTOOLS commonly use detection like this implicitly for extracting the text needed. For example, this library can define a master form template for each major variation you’d expect of the document then uses them to automatically recognize and extract the requested field using OCR and automatic detection like what was described above.

If this sounds like an approach you could consider, you can look here for more details on this.