PDF hyperlink extraction and writing to a pandas dataframe

Question

I am using altered code from this post (my code below):

I am trying to extract hyperlinks (URLs) from a PDF. I found code from the link above which worked. However, I am using Jupyter QtConsole and there are not enough rows to grab everything coming out via the print function. I am thus trying to write each URL as a new row to a pandas dataframe so I can export to a CSV and see everything.

When I run the code below without the commented lines toward the bottom, it runs fine - printing each unique URL to the console. When I add the dataframe lines the code prints each URL 10ish times in QtConsole. The resultant dataframe stops after the first URL (despite the program still running and printing URLs) and it shows the first URL multiple times in the dataframe. I have added comments where I think the problems lie. I'm clearly a bit out of my depth in understanding how to create a new dataframe row for each URL (which I believe is a dictionary key). I'm also thinking my forloop length referencing "pages" is a problem but I'm a bit confused as to what to reference for the forloop length.

Please help.

import pandas as pd 
import PyPDF2
PDFFile = open(r'file\path.pdf','rb')

PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'

for page in range(pages):
    print("Current Page: {}".format(page))
    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()
    if key in pageObject.keys():
        ann = pageObject[key]
        for a in ann:
            try:
                u = a.getObject()
                if uri in u[ank].keys():
                        df = pd.DataFrame(columns=['URL']) #POSSIBLE PROBLEM AREA
                        for i in range(pages): #LIKELY PROBLEM AREA
                            df.loc[i] = (u[ank][uri]) #LIKELY PROBLEM AREA
                            print(u[ank][uri])
            except KeyError:
                pass

score 0 · Answer 1 · answered Jul 20 '20 at 03:09

I figured out how to fix my own problem. This seems to happen when I post to a forum. Perhaps a forum posting is a prerequiste to discovery. Regardless...

All code leading up to what follows is the same.

I created a list (aptly named "mylist") outside the initial forloop. I then append the current "u[ank][uri]" (aka the URL in English) to my list in the nested if where I am printing the URL. I then convert my list into a pandas dataframe at the end. This is giving me the results I was hoping for. I can then write my dataframe to a CSV.

mylist = []

for page in range(pages):
    print("Current Page: {}".format(page))
    pageSliced = PDF.getPage(page)
    pageObject = pageSliced.getObject()
    if key in pageObject.keys():
        ann = pageObject[key]
        for a in ann:
            try:
                u = a.getObject()
                if uri in u[ank].keys():
                        mylist.append(u[ank][uri])
                        print(u[ank][uri])
            except KeyError:
                pass

df = pd.DataFrame(mylist)
df.to_csv('fileoutput.csv')

PDF hyperlink extraction and writing to a pandas dataframe

1 Answers1