Extract hyperlinks from PDF in Python

Question

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.

For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.

How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.

I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.

score 13 · Answer 1 · answered May 24 '19 at 21:23

slightly modified version of Ashwin's Answer:

import PyPDF2
PDFFile = open("file.pdf",'rb')

PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for page in range(pages):
    print("Current Page: {}".format(page))
    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()
    if key in pageObject.keys():
        ann = pageObject[key]
        for a in ann:
            u = a.getObject()
            if uri in u[ank].keys():
                print(u[ank][uri])

PdfFileReader method accept the file as parameter and therefore PDFFile object is not required! — shantanuo, Jun 30 '19 at 07:29

score 11 · Answer 2 · answered Apr 02 '18 at 16:13

This is an old question, but it seems a lot of people look at it (including me while trying to answer this question), so I am sharing the answer I came up with. As a side note, it helps a lot to learn how to use the Python debugger (pdb) so you can inspect these objects on-the-fly.

It is possible to get the hyperlinks using PDFMiner. The complication is (like with so much about PDFs), there is really no relationship between the link annotations and the text of the link, except that they are both located at the same region of the page.

Here is the code I used to get links on a PDFPage

annotationList = []
if page.annots:
    for annotation in page.annots.resolve():
        annotationDict = annotation.resolve()
        if str(annotationDict["Subtype"]) != "/Link":
            # Skip over any annotations that are not links
            continue
        position = annotationDict["Rect"]
        uriDict = annotationDict["A"].resolve()
        # This has always been true so far.
        assert str(uriDict["S"]) == "/URI"
        # Some of my URI's have spaces.
        uri = uriDict["URI"].replace(" ", "%20")
        annotationList.append((position, uri))

Then I defined a function like:

def getOverlappingLink(annotationList, element):
    for (x0, y0, x1, y1), url in annotationList:
        if x0 > element.x1 or element.x0 > x1:
            continue
        if y0 > element.y1 or element.y0 > y1:
            continue
        return url
    else:
        return None

which I used to search the annotationList I previously found on the page to see if any hyperlink occupies the same region as a LTTextBoxHorizontal that I was inspecting on the page.

In my case, since PDFMiner was consolidating too much text together in the text box, I walked through the _objs attribute of each text box and looked though all of the LTTextLineHorizontal instances to see if they overlapped any of the annotation positions.

score 6 · Answer 3 · edited Feb 10 '15 at 06:42

I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:

PDFFile = open('File Location','rb')

PDF = pyPdf.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for page in range(pages):

    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()

    if pageObject.has_key(key):
        ann = pageObject[key]
        for a in ann:
            u = a.getObject()
            if u[ank].has_key(uri):
            print u[ank][uri]

This I hope should give the links in your PDF. P.S: I haven't extensively tried this.

This seems to work fine but is there any way i could extract the text which encloses the hyperlink and modify that ? — Sundeep Pidugu, Apr 22 '19 at 07:07

score 1 · Answer 4 · answered Jan 31 '21 at 08:48

import pikepdf
pdf_file = pikepdf.Pdf.open("pdf.pdf")    
urls = []
for page in pdf_file.pages:
    for annots in page.get("/Annots"):
        url=annots.get("/A").get("/URI")
        if url is not None:
            urls.append(url)
            urls.append(" ; ")
print(urls)

You will get a semicolon separated list of links in the given PDF

score 0 · Answer 5 · answered Jan 02 '15 at 15:16

0

The hyperlink will actually be an annotation, so you need to process the annotation rather than 'extract the text'. I suspect that you are going to need to use a library such as itextsharp, or MuPDF, or Ghostscript if you are really desperate (and comfortable programming in PostScript).

I'd have thought it relatvely easy to process the annotations looking for type LNK though.

answered Jan 02 '15 at 15:16

KenS

30,202
3
34
51

1

I needed both the text as well as the hyperlink, and so I extracted the text. And I'm not exactly sure what you mean by process the annotation... Could you explain that? I'm a bit of an amateur. – Randomly Named User Jan 02 '15 at 15:22
1

You need to use a library which will locate and return all the annotations on a given page (or in the Outlines tree) and return the dictionary describing them. This should contain both the text to be drawn, and the URL. I'm sorry but I can't tell you which library to use or how to use it, I don't know of any that will do this. – KenS Jan 02 '15 at 18:55

score 0 · Answer 6 · answered Sep 27 '19 at 18:04

0

Here's a version that creates a list of URLs in the simplest way I could find:

import PyPDF2

pdf = PyPDF2.PdfFileReader('filename.pdf')

urls = []
for page in range(pdf.numPages):
    pdfPage = pdf.getPage(page)
    try:
        for item in (pdfPage['/Annots']):
            urls.append(item['/A']['/URI'])
    except KeyError:
        pass

answered Sep 27 '19 at 18:04

weebsnore

39
1
1

Fails with "TypeError: 'IndirectObject' object is not subscriptable" on the item lookup. – gasstationwithoutpumps Sep 05 '20 at 18:10

Extract hyperlinks from PDF in Python

6 Answers6

Linked