I'm using following code to extract URLs from PDF and it works fine to extract the anchor but does not work when anchor text is different than the URL behind it. For example: 'www.page.com/A' is used as a short url in the text but the actual URL behind it is a longer (full) version.
The code I'm using is:
import urllib.request
import PyPDF2
urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict=False)
key = "/Annots"
uri = "/URI"
ank = "/A"
mylist = []
for page_no in range(pdfFile.numPages):
page = pdfFile.getPage(page_no)
text = page.extractText()
pageObject = page.getObject()
if key in pageObject.keys():
ann = pageObject.keys()
for a in ann:
try:
u = a.getObject()
if uri in u[ank].keys():
mylist.append(u[ank][uri])
print(u[ank][uri])
except KeyError:
pass
As I said, it works ok if the anchor and the link are the same. If the link is different, it saves the anchor. Ideally I would save both (or just link).