Scenario:
I have an application that makes use of iTextSharp to scourge PDF files for hyperlinks.
Hyperlinks in PDFs are a sub-type of an "annotation object" in the file structure, so my code essentially (1) reads a file, (2) loops through pages, (3) gets the annotations collection for the page, and (4) extracts the hyperlink annotations for the page.
Issue
Sometimes the "pdf dictionary" object representing a given page does not have a collection of annotations (no /ANNOTS
) key. Thus attempts at getting such a collection return null
. This is an issue because it happens now and then when there are plainly visible and clickable links on the page in question.
Note that clickable is important here because I understand there may be URL addresses present in the plain text, but I do not care about those, only the actual true-to-life hyperlinks.
Code
I found similar SO question (http://stackoverflow.com/questions/6959076/reading-hyperlinks-from-pdf-file) by the answer provided is almost exactly the code I'm already using. The key difference is this:
// My code
var pdfAnnotations = (PdfArray)PdfReader.GetPdfObject(pageDict.Get(PdfName.ANNOTS));
foreach (var annotation in pdfAnnotations.ArrayList) {}
{
// Chris' code
var annotsArray = pageDict.GetAsArray(PdfName.ANNOTS);
foreach(var annotation in annotsArray.ArrayList) { }
// My pageDict.Get() and Chris's pageDict.GetAsArray() methods both
// return null because there is no ANNOTS key present in pageDict.
Question
Why the null value? How can a PDF document with plainly visible/clickable links have no annotations collection? Are there other PdfObject
sub-types within the file structure that represent hyperlinks/URI?
Thanks