using ITextSharp to extract and update links in an existing PDF

Question

I need to post several (read: a lot) PDF files to the web but many of them have hard coded file:// links and links to non-public locations. I need to read through these PDFs and update the links to the proper locations. I've started writing an app using itextsharp to read through the directories and files, find the PDFs and iterate through each page. What I need to do next is find the links and then update the incorrect ones.

string path = "c:\\html";
DirectoryInfo rootFolder = new DirectoryInfo(path);

foreach (DirectoryInfo di in rootFolder.GetDirectories())
{
    // get pdf
    foreach (FileInfo pdf in di.GetFiles("*.pdf"))
    {
        string contents = string.Empty;
        Document doc = new Document();
        PdfReader reader = new PdfReader(pdf.FullName);

        using (MemoryStream ms = new MemoryStream())
        {
            PdfWriter writer = PdfWriter.GetInstance(doc, ms);
            doc.Open();

            for (int p = 1; p <= reader.NumberOfPages; p++)
            {
                byte[] bt = reader.GetPageContent(p);

            }
        }
    }
}

Quite frankly, once I get the page content I'm rather lost on this when it comes to iTextSharp. I've read through the itextsharp examples on sourceforge, but really didn't find what I was looking for.

Any help would be greatly appreciated.

Thanks.

score 33 · Accepted Answer · edited May 23 '17 at 12:33

This one is a little complicated if you don't know the internals of the PDF format and iText/iTextSharp's abstraction/implementation of it. You need to understand how to use PdfDictionary objects and look things up by their PdfName key. Once you get that you can read through the official PDF spec and poke around a document pretty easily. If you do care I've included the relevant parts of the PDF spec in parenthesis where applicable.

Anyways, a link within a PDF is stored as an annotation (PDF Ref 12.5). Annotations are page-based so you need to first get each page's annotation array individually. There's a bunch of different possible types of annotations so you need to check each one's SUBTYPE and see if its set to LINK (12.5.6.5). Every link should have an ACTION dictionary associated with it (12.6.2) and you want to check the action's S key to see what type of action it is. There's a bunch of possible ones for this, link's specifically could be internal links or open file links or play sound links or something else (12.6.4.1). You are looking only for links that are of type URI (note the letter I and not the letter L). URI Actions (12.6.4.7) have a URI key that holds the actual address to navigate to. (There's also an IsMap property for image maps that I can't actually imagine anyone using.)

Whew. Still reading? Below is a full working VS 2010 C# WinForms app based on my post here targeting iTextSharp 5.1.1.0. This code does two main things: 1) Create a sample PDF with a link in it pointing to Google.com and 2) replaces that link with a link to bing.com. The code should be pretty well commented but feel free to ask any questions that you might have.

using System;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text;
using iTextSharp.text.pdf;
using System.IO;

namespace WindowsFormsApplication1
{
    public partial class Form1 : Form
    {

        //Folder that we are working in
        private static readonly string WorkingFolder = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Hyperlinked PDFs");
        //Sample PDF
        private static readonly string BaseFile = Path.Combine(WorkingFolder, "OldFile.pdf");
        //Final file
        private static readonly string OutputFile = Path.Combine(WorkingFolder, "NewFile.pdf");

        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            CreateSamplePdf();
            UpdatePdfLinks();
            this.Close();
        }

        private static void CreateSamplePdf()
        {
            //Create our output directory if it does not exist
            Directory.CreateDirectory(WorkingFolder);

            //Create our sample PDF
            using (iTextSharp.text.Document Doc = new iTextSharp.text.Document(PageSize.LETTER))
            {
                using (FileStream FS = new FileStream(BaseFile, FileMode.Create, FileAccess.Write, FileShare.Read))
                {
                    using (PdfWriter writer = PdfWriter.GetInstance(Doc, FS))
                    {
                        Doc.Open();

                        //Turn our hyperlink blue
                        iTextSharp.text.Font BlueFont = FontFactory.GetFont("Arial", 12, iTextSharp.text.Font.NORMAL, iTextSharp.text.BaseColor.BLUE);

                        Doc.Add(new Paragraph(new Chunk("Go to URL", BlueFont).SetAction(new PdfAction("http://www.google.com/", false))));

                        Doc.Close();
                    }
                }
            }
        }

        private static void UpdatePdfLinks()
        {
            //Setup some variables to be used later
            PdfReader R = default(PdfReader);
            int PageCount = 0;
            PdfDictionary PageDictionary = default(PdfDictionary);
            PdfArray Annots = default(PdfArray);

            //Open our reader
            R = new PdfReader(BaseFile);
            //Get the page cont
            PageCount = R.NumberOfPages;

            //Loop through each page
            for (int i = 1; i <= PageCount; i++)
            {
                //Get the current page
                PageDictionary = R.GetPageN(i);

                //Get all of the annotations for the current page
                Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);

                //Make sure we have something
                if ((Annots == null) || (Annots.Length == 0))
                    continue;

                //Loop through each annotation

                foreach (PdfObject A in Annots.ArrayList)
                {
                    //Convert the itext-specific object as a generic PDF object
                    PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);

                    //Make sure this annotation has a link
                    if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
                        continue;

                    //Make sure this annotation has an ACTION
                    if (AnnotationDictionary.Get(PdfName.A) == null)
                        continue;

                    //Get the ACTION for the current annotation
                    PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);

                    //Test if it is a URI action
                    if (AnnotationAction.Get(PdfName.S).Equals(PdfName.URI))
                    {
                        //Change the URI to something else
                        AnnotationAction.Put(PdfName.URI, new PdfString("http://www.bing.com/"));
                    }
                }
            }

            //Next we create a new document add import each page from the reader above
            using (FileStream FS = new FileStream(OutputFile, FileMode.Create, FileAccess.Write, FileShare.None))
            {
                using (Document Doc = new Document())
                {
                    using (PdfCopy writer = new PdfCopy(Doc, FS))
                    {
                        Doc.Open();
                        for (int i = 1; i <= R.NumberOfPages; i++)
                        {
                            writer.AddPage(writer.GetImportedPage(R, i));
                        }
                        Doc.Close();
                    }
                }
            }
        }
    }
}

EDIT

I should note, this only changes the actual link. Any text within the document won't get updated. Annotations are drawn on top of text but aren't really tied to the text underneath in anyway. That's another topic completely.

Wow, upvote for detailed references to PDF spec and a very detailed answer with sample code. I am going to look through it and see if we can use the concepts (older version of Java iText). — Philip Tenn, Oct 18 '13 at 19:28
Is it possible to execute a javascript instead of doing a hyperlink action without losing the formation of the text within the document? — Florian Leitgeb, Jun 12 '14 at 11:36
@Floeee, you can definitely do this. You'll need to change the value of `/S` to `/JAVASCRIPT` and use a `/JS` entry for your actual JavaScript. If you need more help most a new question here and I can answer better. — Chris Haas, Jun 12 '14 at 14:19

tofo · Answer 2 · 2014-02-23T13:47:34.167

Noted if the Action is indirect it will not return a dictionary and you will have an error:

PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);

In cases of possible indirect dictionaries:

PdfDictionary Action = null;

//Get action directly or by indirect reference
PdfObject obj = Annotation.Get(PdfName.A);
if (obj.IsIndirect) {
    Action = PdfReader.GetPdfObject(obj);
} else {
    Action = (PdfDictionary)obj;
}

In that case you have to investigate the returned dictionary to figure out where the URI is found. As with an indirect /Launch dictionary the URI is located in the /F item being of type PRIndirectReference with the /Type being a /FileSpec and the URI located in the value of /F

score 2 · Answer 3 · answered Dec 09 '14 at 17:51

Added code for dealing with indirect and launch actions and null annotation-dictionary:

PdfReader r = new PdfReader(@"d:\kb2\" + f);
for (int i = 1; i <= r.NumberOfPages; i++) {
    //Get the current page
    var PageDictionary = r.GetPageN(i);

    //Get all of the annotations for the current page
    var Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);

    //Make sure we have something
    if ((Annots == null) || (Annots.Length == 0))
        continue;
    foreach (var A in Annots.ArrayList) {
        var AnnotationDictionary = PdfReader.GetPdfObject(A) as PdfDictionary;
        if (AnnotationDictionary == null)
            continue;
        //Make sure this annotation has a link
        if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
            continue;

        //Make sure this annotation has an ACTION
        if (AnnotationDictionary.Get(PdfName.A) == null)
            continue;

        var annotActionObject = AnnotationDictionary.Get(PdfName.A);
        var AnnotationAction = (PdfDictionary)(annotActionObject.IsIndirect() ? PdfReader.GetPdfObject(annotActionObject) : annotActionObject); 

        var type = AnnotationAction.Get(PdfName.S);
        //Test if it is a URI action
        if (type.Equals(PdfName.URI)) {
            //Change the URI to something else
            string relativeRef = AnnotationAction.GetAsString(PdfName.URI).ToString();
            AnnotationAction.Put(PdfName.URI, new PdfString(url));
        } else if (type.Equals(PdfName.LAUNCH)) {
            //Change the URI to something else
            var filespec = AnnotationAction.GetAsDict(PdfName.F);
            string url = filespec.GetAsString(PdfName.F).ToString();
            AnnotationAction.Put(PdfName.F, new PdfString(url));
        }
    }
}
//Next we create a new document add import each page from the reader above
using (var output = File.OpenWrite(outputFile.FullName)) {
    using (Document Doc = new Document()) {
        using (PdfCopy writer = new PdfCopy(Doc, output)) {
            Doc.Open();
            for (int i = 1; i <= r.NumberOfPages; i++) {
                writer.AddPage(writer.GetImportedPage(r, i));
            }
            Doc.Close();
        }
    }
}
r.Close();

using ITextSharp to extract and update links in an existing PDF

3 Answers3

Linked