How to Reading hyperlinks with AnchorText from pdf file C#

Question

I have taken the link values from PDF file like http://google.com but I need to take the anchor text value, for example click here. How to to take the anchor link value text?

I have taken the URL value of the PDF file by using the below URL: Reading hyperlinks from pdf file for example.

Anchor a = new Anchor("Test Anchor");
a.Reference = "http://www.google.com";
myParagraph.Add(a);

Here I get the http://www.google.com but I need to get anchor value i.e. Test Anchor

Need your suggestions.

score 5 · Answer 1 · edited May 17 '13 at 07:05

From the PDF file you need to identify the region where the link is placed and then read the text below the link using iTextSharp.

This way you can extract the text underneath the link. The limitation of this approach is that if the link region is wider than the text, the extraction will read the full text under that region.

private void GetAllHyperlinksFromPDFDocument(string pdfFilePath)
{
    string linkTextBuilder = "";
    string linkReferenceBuilder = "";

    PdfDictionary PageDictionary = default(PdfDictionary);
    PdfArray Annots = default(PdfArray);
    PdfReader R = new PdfReader(pdfFilePath);

    List<BinaryHyperlink> ret = new List<BinaryHyperlink>();

    //Loop through each page
    for (int i = 1; i <= R.NumberOfPages; i++)
    {
        //Get the current page
        PageDictionary = R.GetPageN(i);

        //Get all of the annotations for the current page
        Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);

        //Make sure we have something
        if ((Annots == null) || (Annots.Length == 0))
            continue;

        //Loop through each annotation

        foreach (PdfObject A in Annots.ArrayList)
        {
            //Convert the itext-specific object as a generic PDF object
            PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);

            //Make sure this annotation has a link
            if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
                continue;

            //Make sure this annotation has an ACTION
            if (AnnotationDictionary.Get(PdfName.A) == null)
                continue;

            //Get the ACTION for the current annotation
            PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.GetAsDict(PdfName.A);
            if (AnnotationAction.Get(PdfName.S).Equals(PdfName.URI))
            {
                //Get action link URL : linkReferenceBuilder
                PdfString Link = AnnotationAction.GetAsString(PdfName.URI);
                if (Link != null)
                    linkReferenceBuilder = Link.ToString();

                //Get action link text : linkTextBuilder
                var LinkLocation = AnnotationDictionary.GetAsArray(PdfName.RECT);
                List<string> linestringlist = new List<string>();
                iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(((PdfNumber)LinkLocation[0]).FloatValue, ((PdfNumber)LinkLocation[1]).FloatValue, ((PdfNumber)LinkLocation[2]).FloatValue, ((PdfNumber)LinkLocation[3]).FloatValue);
                RenderFilter[] renderFilter = new RenderFilter[1];
                renderFilter[0] = new RegionTextRenderFilter(rect);
                ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
                linkTextBuilder = PdfTextExtractor.GetTextFromPage(R, i, textExtractionStrategy).Trim();
            }
        }
    }
}

score 2 · Answer 2 · answered Apr 06 '12 at 15:54

Unfortunately I don't think you're going to be able to do this, at least not without a lot of guess-work. In HTML this would be easy because a hyperlink and its text are stored together as:

<a href="http://www.example.com/">Click here</a>

However, in a PDF these two entities are not stored with any form of relationship. What we think of as a "hyperlink" within a PDF is technically a PDF Annotation that just happens to be sitting on top of text. You can see this by opening a PDF in an editing program such as Adobe Acrobat Pro. You can change the text but the "clickable" area doesn't change. You can also move and resize the "clickable" area and put it anywhere in the document.

When creating PDFs, iText/iTextSharp abstract this away so you don't have to think about this. You can create a "hyperlink" with clickable text but when it generates a PDF it ultimately will create the text as normal text, calculate the rectangle coordinates and then put an annotation at that rectangle.

I did say that you could try to guess at this, and it might or might not work for you. To do this you'd need to get the rectangle for annotation and then find the text that's also at those coordinates. It won't be an exact match, however, because of padding issues. If you absolutely have to get the text under a hyperlink then this is the only way that I know of for doing this. Good luck!

How to Reading hyperlinks with AnchorText from pdf file C#

2 Answers2