Cause for Inconsistent Extractions of PDF Annotations with iText(Sharp)

Question

Scenario:

I have an application that makes use of iTextSharp to scourge PDF files for hyperlinks.

Hyperlinks in PDFs are a sub-type of an "annotation object" in the file structure, so my code essentially (1) reads a file, (2) loops through pages, (3) gets the annotations collection for the page, and (4) extracts the hyperlink annotations for the page.

Issue

Sometimes the "pdf dictionary" object representing a given page does not have a collection of annotations (no /ANNOTS) key. Thus attempts at getting such a collection return null. This is an issue because it happens now and then when there are plainly visible and clickable links on the page in question.

Note that clickable is important here because I understand there may be URL addresses present in the plain text, but I do not care about those, only the actual true-to-life hyperlinks.

Code

I found similar SO question (http://stackoverflow.com/questions/6959076/reading-hyperlinks-from-pdf-file) by the answer provided is almost exactly the code I'm already using. The key difference is this:

// My code
var pdfAnnotations = (PdfArray)PdfReader.GetPdfObject(pageDict.Get(PdfName.ANNOTS));
foreach (var annotation in pdfAnnotations.ArrayList) {}
                    {

// Chris' code                        
var annotsArray = pageDict.GetAsArray(PdfName.ANNOTS); 
foreach(var annotation in annotsArray.ArrayList) { }

// My pageDict.Get() and Chris's pageDict.GetAsArray() methods both 
// return null because there is no ANNOTS key present in pageDict.

Question

Why the null value? How can a PDF document with plainly visible/clickable links have no annotations collection? Are there other PdfObject sub-types within the file structure that represent hyperlinks/URI?

Thanks

Can you provide an example of a 1-page PDF with a clickable link where there is no `/ANNOTS` key in the PDF source code? — Kurt Pfeifle, Jul 09 '12 at 17:39
Unfortunately no. These are work files that I can't share with the public, and even if I could, nearly all file-sharing websites are blocked. Makes this situation hard, I know. — one.beat.consumer, Jul 09 '12 at 19:35

Kurt Pfeifle · Answer 1 · 2012-07-09T22:03:36.100

Let me try with a guess then. (With no sample to analyze, there is no way to do anything else.)

BTW, inside PDF code it's never /ANNOTS -- PDF keys are case sensitive! -- it's always /Annots.

In PDF source code, an ASCII string like /Annots as a name object may be represented in any of the following alternative ways. These are all 'legal' according to the PDF spec (see Paragraph 7.3.5, Name Objects, of the PDF-1.7 specification):

 /Annots
 /#41nnots      # '#41' is the hex represenation of ASCII 'A' in PDF
 /A#6Enots      # '#6E' is the hex represenation of ASCII 'n' in PDF
 /An#6Eots      # '#6E' is the hex represenation of ASCII 'n' in PDF
 /A#6E#6Eots    # '#6E' is the hex represenation of ASCII 'n' in PDF
 ...
 /Annot#73      # '#73' is the hex represenation of ASCII 's' in PDF

You get the idea... (If my quick calculation is correct, you can make 32 different variations of this...)

This, BTW, is one of the most simple means which blackhat hackers use to obfuscate a /#4Aava#53cript key in their malware PDFs! A more complete list of their potential methods see the 'Corkami Project'.)

Maybe your version of iTextSharp (which you didn't state) doesn't correctly handle your search for all representations of the /Annots name key?

If so, then my suggestion to you is that you normalize a copy of each PDF before you look for your /Annots. You can successfully achieve this with the help of the commandline tool (and API of) qpdf:

 qpdf --qdf helloworld.pdf qdf---helloworld.pdf

Let's see:

 kp@mbp:~$  grep nnots helloworld.pdf
      /#41nnots 57 0 R

 kp@mbp:~$  qpdf --qdf helloworld.pdf qdf---helloworld.pdf

 kp@mbp:~$  grep nnots qdf---helloworld.pdf
 qdf---helloworld.pdf:     /Annots 57 0 R

score 0 · Answer 2 · answered Jul 09 '12 at 20:28

I'm pretty sure there are not any other Link-like PDF objects (aside from Outline/Bookmark elements and embedded javascript-related stuff) that you need to worry about. But some readers find URL patterns in the text and go ahead and make these clickable, even though they are not encoded as Link annotations. Without a PDF to look it, the best guess is that this is what is happening in your case. (You can test this out by creating a PDF with a simple URL in the text (but no Link annotation) and see if your reader makes it clickable.)

Cause for Inconsistent Extractions of PDF Annotations with iText(Sharp)

Scenario:

Issue

Code

Question

2 Answers2