2

I have created a Dataframe of 80,000 Links of PDFs & Have also created a code to convert the Link of the PDF into a Text file. now the issue that i am not getting is that how to add another column to my dataframe which will correspond to the link of the PDF. Like if their is a row with link of PDF like - anc.pdf the row next to it should have the text this url contains. Here is my code -

def urltopdf(link):
    req = urllib.request.urlopen(link)
    file = open("C:\Shodh by Arthavruksha\CorporateAnnouncements\DailyCA.pdf", 'wb')
    file.write(req.read())
    file.close()


def PDFtoText(filepath):
    myfile = open(filepath, 'rb')
    text = []
    pdf = PyPDF2.PdfFileReader(myfile)
    for p in range(pdf.numPages):
        page = pdf.getPage(p)
        text.append(page.extractText())
    text = ''.join(map(str, text))
    myfile.close()
    return text


allcapex = pd.DataFrame(columns=['Text'])

sourcecol = result['Source']
allcapex.insert(0, "Source", sourcecol)

for i in allcapex['Source']:
    urltopdf(i)
    pdftext = PDFtoText("C:\Shodh by Arthavruksha\CorporateAnnouncements\DailyCA.pdf")
    allcapex.loc[len(allcapex.index)] = ['', pdftext]

print(allcapex)
Park
  • 2,446
  • 1
  • 16
  • 25

1 Answers1

0

You may construct a list of dictionary first and finally convert it into a dataframe:

allcapex_list = []

for i in result['Source']:
    try:
        urltopdf(i)
    except Exception as e:
        print('An error with urltopdf')
        print('Link:', i)
        print('Error:', e)
        continue #add this to skip invalid link

    link = i

    try:
        pdftext = PDFtoText("C:\Shodh by Arthavruksha\CorporateAnnouncements\DailyCA.pdf")
    except Exception as e:
        print('An error with PDFtoText')
        print('Link:', link)
        print('Error:', e)
        continue
    allcapex_list.append({'link': link, 'text': pdftext})
pd.DataFrame(allcapex_list)

Outcome will be a dataframe of two columns - link and text.

Raymond Kwok
  • 2,461
  • 2
  • 9
  • 11
  • Sir the code after running is showing me error. Result['Source'] has all 80,000 Links . So what i did is that i gave the urltopdf function (i). Although its not working for me – Jay shankarpure Feb 19 '22 at 13:31
  • Updated the code to include urltopdf(i). My answer is to show you an idea on how to have a column for link in your dataframe. – Raymond Kwok Feb 19 '22 at 13:35
  • Yep Sir I got it , Thanks for your Answer. Although the Code isn't working & Giving me many error like - line 70, in urltopdf(i) File "c:\Shodh by Arthavruksha\CorporateAnnouncements\CustomCA.py", line 15, in urltopdf req = urllib.request.urlopen(link) – Jay shankarpure Feb 19 '22 at 13:39
  • what is the error message? – Raymond Kwok Feb 19 '22 at 13:42
  • not 1 its giving me many errors, Like these -line 70, in urltopdf(i) File "c:\Shodh by Arthavruksha\CorporateAnnouncements\CustomCA.py", line 15, in urltopdf req = urllib.request.urlopen(link) . line 216, in urlopen return opener.open(url, data, timeout)line 503, in open req = Request(fullurl, data).line 322, in __init__ self.full_url = url. in full_url self._parse(). line 377, in _parse raise ValueError("unknown url type: %r" % self.full_url) ValueError: unknown url type: '-' – Jay shankarpure Feb 19 '22 at 13:45
  • it seems to suggest that there is at least one link that you pass into `urllib.request.urlopen` is **NOT** a valid link. I updated the code to use `try` to print the link whenever an error is raised. You can take a look at the (those) link(s) and see if they are valid or not. – Raymond Kwok Feb 19 '22 at 13:58
  • Hi Sir I ran the code it printed out the links , I verified myself that are those links working . And they were infact working greatly. What to do now. It gave me this error. unknown url type: '-' PdfReadWarning: Superfluous whitespace found in object header b'86' b'0' [pdf.py:1665] – Jay shankarpure Feb 19 '22 at 14:15
  • It will work without stopping you **only because** the `try` block will catch the error and let your code run by skipping that link. You actually missed links that are printed out. I suppose this should bother you? If so please share one or two of those links here and let's see if we can find any problem from them. – Raymond Kwok Feb 19 '22 at 14:26
  • I updated the code so that the `try` block will only catch error for `urltopdf(...)`, please try this. – Raymond Kwok Feb 19 '22 at 14:29
  • This is the url sir - https://archives.nseindia.com/corporate/KHAICHEM_31122021184441_NSECR.pdf, PyPDF2.utils.PdfReadError: Expected object ID (4 0) does not match actual (3 0); xref table not zero-indexed. – Jay shankarpure Feb 19 '22 at 15:01
  • It's strange that the link you share has caused this error `ValueError: unknown url type: '-'` – Raymond Kwok Feb 19 '22 at 15:08
  • Correct , Thats what i am not understanding , I looked into this Value errror also , though didn't find anything – Jay shankarpure Feb 19 '22 at 15:09
  • Sorry I made a mistake in my code, please try the updated version. – Raymond Kwok Feb 19 '22 at 15:11
  • Hi Sir still giving me same error .PyPDF2.utils.PdfReadError: Expected object ID (4 0) does not match actual (3 0); xref table not zero-indexed. & Link: - Error: unknown url type: '-' – Jay shankarpure Feb 19 '22 at 15:29
  • Very well. From this `Link: - Error: unknown url type: '-'` we understand that we see this error because the link is just a hyphen and is not a valid URL. – Raymond Kwok Feb 19 '22 at 15:32
  • okay sir , How should i delete those from my dataframe – Jay shankarpure Feb 19 '22 at 15:34
  • Updated my code again. Please note what I have added. `continue` will end the current step of the loop early. So if we have an invalid link, it will move to the next link immediately without moving on to `PDFtoText`. It'll print the invalid Link so you can check it manually. – Raymond Kwok Feb 19 '22 at 15:37
  • I also added a `try` around `PDFtoText` so that it will `continue` if an error occurs. As for `PdfReadError: Expected object ID (4 0) does ...` that you mentioned, please check out this answer https://stackoverflow.com/a/59987006/11065465 – Raymond Kwok Feb 19 '22 at 15:39
  • The code is very slow sir , Its still running since last 10 minutes, Hence can't say whether it works or not – Jay shankarpure Feb 19 '22 at 15:52
  • it's slow because you have as many as 80000 links. However, during the process, if an error occurs, you should be able to see new messages printed out. Then you know which link does not work and the corresponding error message. Then you can collect links that *you think it should have worked* and try **only** those after all 80000 links are processed. If it's just slow but no new error message, it is a good thing! – Raymond Kwok Feb 19 '22 at 15:58
  • downloading files from internet can be slow. downloading 80000 files from internet can be very slow. – Raymond Kwok Feb 19 '22 at 16:00
  • Link: https://archives.nseindia.com/corporate/ASTERDM_27122021194152_InvtmeetDec272021.pdf Error: File has not been decrypted An error with PDFtoText Link: https://archives.nseindia.com/corporate/NBIFIN_27122021191923_BriefProfile_27122021191904.zip Error: EOF marker not found An error with PDFtoText Link: https://archives.nseindia.com/corporate/GOKEX_27122021191837_LD271221I.pdf Error: Multiple definitions in dictionary at byte 0x773d1 for key /Info An error with PDFtoText Link: https://archives.nseindia.com/corporate/GOKEX_27122021191141_LD271221C.pdf – Jay shankarpure Feb 19 '22 at 16:09
  • Exactly. Now you know which link has error and the reason. From your messages here, there are at least 3 different errors. The next thing would be to google each error message and see if a quick fix is available. I am sorry that I am not an expect in PDF and so I cannot help you here. Also I note that one of the four links points to a `zip` file instead of a `pdf`. If you find that there are too many links with errors, you may want to save them programmatically instead of copy-and-paste. – Raymond Kwok Feb 19 '22 at 16:13