I have created a Dataframe of 80,000 Links of PDFs & Have also created a code to convert the Link of the PDF into a Text file. now the issue that i am not getting is that how to add another column to my dataframe which will correspond to the link of the PDF. Like if their is a row with link of PDF like - anc.pdf the row next to it should have the text this url contains. Here is my code -
def urltopdf(link):
req = urllib.request.urlopen(link)
file = open("C:\Shodh by Arthavruksha\CorporateAnnouncements\DailyCA.pdf", 'wb')
file.write(req.read())
file.close()
def PDFtoText(filepath):
myfile = open(filepath, 'rb')
text = []
pdf = PyPDF2.PdfFileReader(myfile)
for p in range(pdf.numPages):
page = pdf.getPage(p)
text.append(page.extractText())
text = ''.join(map(str, text))
myfile.close()
return text
allcapex = pd.DataFrame(columns=['Text'])
sourcecol = result['Source']
allcapex.insert(0, "Source", sourcecol)
for i in allcapex['Source']:
urltopdf(i)
pdftext = PDFtoText("C:\Shodh by Arthavruksha\CorporateAnnouncements\DailyCA.pdf")
allcapex.loc[len(allcapex.index)] = ['', pdftext]
print(allcapex)