Extract URLs from PDF - text doesn't match URL

Question

I'm using following code to extract URLs from PDF and it works fine to extract the anchor but does not work when anchor text is different than the URL behind it. For example: 'www.page.com/A' is used as a short url in the text but the actual URL behind it is a longer (full) version.

The code I'm using is:

import urllib.request
import PyPDF2

urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict=False)

key = "/Annots"
uri = "/URI"
ank = "/A"
mylist = []

for page_no in range(pdfFile.numPages):
    page = pdfFile.getPage(page_no)
    text = page.extractText()
    pageObject = page.getObject()
    if key in pageObject.keys():
        ann = pageObject.keys()
        for a in ann:
            try:
                u = a.getObject()
                if uri in u[ank].keys():
                    mylist.append(u[ank][uri])
                    print(u[ank][uri])
            except KeyError:
                pass

As I said, it works ok if the anchor and the link are the same. If the link is different, it saves the anchor. Ideally I would save both (or just link).

Could this be something helpful? https://stackoverflow.com/a/49614726/3390788 — Frank Alvaro, Mar 21 '22 at 12:44
@FrankAlvaro the first solution seems to be pulling both the anchor and the URL. The only issue is, it doesn't match them (so no way of me knowing which one is which). I can't match them on a page number as some pages would have multiple URLs. — jkierzyk, Mar 21 '22 at 12:54

Extract URLs from PDF - text doesn't match URL

0 Answers0