I am using altered code from this post (my code below):
Extract hyperlinks from PDF in Python
I am trying to extract hyperlinks (URLs) from a PDF. I found code from the link above which worked. However, I am using Jupyter QtConsole and there are not enough rows to grab everything coming out via the print function. I am thus trying to write each URL as a new row to a pandas dataframe so I can export to a CSV and see everything.
When I run the code below without the commented lines toward the bottom, it runs fine - printing each unique URL to the console. When I add the dataframe lines the code prints each URL 10ish times in QtConsole. The resultant dataframe stops after the first URL (despite the program still running and printing URLs) and it shows the first URL multiple times in the dataframe. I have added comments where I think the problems lie. I'm clearly a bit out of my depth in understanding how to create a new dataframe row for each URL (which I believe is a dictionary key). I'm also thinking my forloop length referencing "pages" is a problem but I'm a bit confused as to what to reference for the forloop length.
Please help.
import pandas as pd
import PyPDF2
PDFFile = open(r'file\path.pdf','rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
print("Current Page: {}".format(page))
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if key in pageObject.keys():
ann = pageObject[key]
for a in ann:
try:
u = a.getObject()
if uri in u[ank].keys():
df = pd.DataFrame(columns=['URL']) #POSSIBLE PROBLEM AREA
for i in range(pages): #LIKELY PROBLEM AREA
df.loc[i] = (u[ank][uri]) #LIKELY PROBLEM AREA
print(u[ank][uri])
except KeyError:
pass