4

I'm trying to read a pdf file and get all hyperlinks from this file. I'm using iTextSharp for C# .net.

PdfReader reader = new PdfReader("test.pdf");           
List<PdfAnnotation.PdfImportedLink> list = reader.GetLinks(36); 

This method "GetLinks" return a list with a lot of information about the links, but this method does not return the value that I want, the hyperlink string and I exactly know that there are hyperlinks in 36th page

levi
  • 3,451
  • 6
  • 50
  • 86

2 Answers2

4

PdfReader.GetLinks() is only meant to be used with links internal to the document, not external hyperlinks. Why? I don't know.

The code below is based off of code I wrote earlier but I've limited it to links stored in the PDF as a PdfName.URI. Its possible to store the link as Javascript that ultimately does the same thing and there's probably other types but you'll need to detect for that. I don't believe there's anything in the spec that says that a link actually needs to be a URI, its just implied, so the code below returns a string that you can (probably) convert to a URI on your own.

    private static List<string> GetPdfLinks(string file, int page)
    {
        //Open our reader
        PdfReader R = new PdfReader(file);

        //Get the current page
        PdfDictionary PageDictionary = R.GetPageN(page);

        //Get all of the annotations for the current page
        PdfArray Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);

        //Make sure we have something
        if ((Annots == null) || (Annots.Length == 0))
            return null;

        List<string> Ret = new List<string>();

        //Loop through each annotation
        foreach (PdfObject A in Annots.ArrayList)
        {
            //Convert the itext-specific object as a generic PDF object
            PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);

            //Make sure this annotation has a link
            if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
                continue;

            //Make sure this annotation has an ACTION
            if (AnnotationDictionary.Get(PdfName.A) == null)
                continue;

            //Get the ACTION for the current annotation
            PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);

            //Test if it is a URI action (There are tons of other types of actions, some of which might mimic URI, such as JavaScript, but those need to be handled seperately)
            if (AnnotationAction.Get(PdfName.S).Equals(PdfName.URI))
            {
                PdfString Destination = AnnotationAction.GetAsString(PdfName.URI);
                if (Destination != null)
                    Ret.Add(Destination.ToString());
            }
        }

        return Ret;

    }

And call it:

        string myfile = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Output.pdf");
        List<string> Links = GetPdfLinks(myfile, 1);
Community
  • 1
  • 1
Chris Haas
  • 53,986
  • 12
  • 141
  • 274
  • Chris: your code above is almost exactly like mine, and seems to be working properly most of the time. The problem I am running into is when trying to get `PdfName.ANNOTS` sometimes I get a `null` value when I can plainly see there are hyperlinks in the document. Any thoughts? Thanks. – one.beat.consumer Jul 09 '12 at 16:57
  • The first thing I'd tell you to do would be to open the PDF in Acrobat Pro (if you have it), run Preflight on it, go to Options and the Browse Internal PDF Structure and see if you have any Annots listed in there. The other thing I'd tell you would be to make sure that you're counting page numbers starting at one and not zero, I've made that mistake many times. If that doesn't help and the file isn't confidential you can email it to me, my address is in my profile. – Chris Haas Jul 10 '12 at 14:32
  • I would like to attach a javascript action depending on the URI i got. Guess you have to attach this to the found PdfObject, but how? – Florian Leitgeb Jun 12 '14 at 09:27
3

I have noticed that any text on a PDF that looks like a URL can be simulated as a annotation link by the PDF vewer. In Adobe Acrobat there is a page display preference under the general tab called "Create links from URLs" that controls this. I was writing code to remove URL link annotations, only to find that there were none. But yet Acrobat was automatically turning text that looked like a URL into a what appeared to be an annotation link.

C. Payton
  • 31
  • 1