-2

Download 1000's PDF's links listed in csv with python Request module.

WarLock
  • 57
  • 1
  • 13

1 Answers1

3

I would suggest you to use Requests, then you can do something along the lines of:

import os
import csv
import requests

write_path = 'folder_name'  # ASSUMING THAT FOLDER EXISTS!

with open('x.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile)
    for link in spamreader:
        print('-'*72)
        pdf_file = link[0].split('/')[-1]
        with open(os.path.join(write_path, pdf_file), 'wb') as pdf:
            try:
                # Try to request PDF from URL
                print('TRYING {}...'.format(link[0]))
                a = requests.get(link[0], stream=True)
                for block in a.iter_content(512):
                    if not block:
                        break

                    pdf.write(block)
                print('OK.')
            except requests.exceptions.RequestException as e:  # This will catch ONLY Requests exceptions
                print('REQUESTS ERROR:')
                print(e)  # This should tell you more details about the error

Testing content of x.csv is:

https://www.pabanker.com/media/3228/qtr1pabanker_final-web.pdf
http://www.pdf995.com/samples/pdf.pdf
https://tcd.blackboard.com/webapps/dur-browserCheck-BBLEARN/samples/sample.pdf
http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf

Sample output:

$ python test.py
------------------------------------------------------------------------
TRYING https://www.pabanker.com/media/3228/qtr1pabanker_final-web.pdf...
REQUESTS ERROR:
("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))
------------------------------------------------------------------------
TRYING http://www.pdf995.com/samples/pdf.pdf...
OK.
------------------------------------------------------------------------
TRYING https://tcd.blackboard.com/webapps/dur-browserCheck-BBLEARN/samples/sample.pdf...
OK.
------------------------------------------------------------------------
TRYING http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf...
OK.
errata
  • 5,695
  • 10
  • 54
  • 99
  • I tried to read csv by using urllib. But unable to read csv file it says PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736] and import request is black. why? – WarLock May 12 '17 at 13:10
  • But this is `PdfRead` library error... Can you post the code snippet which results in this error? – errata May 12 '17 at 13:14
  • here's the screen shot of code and error http://imgur.com/a/kQaXZ – WarLock May 12 '17 at 13:26
  • I just updated my answer, check if that's what you need... – errata May 12 '17 at 13:27
  • no same error Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736] yes I have specific pdf links in csv file. – WarLock May 12 '17 at 13:35
  • Requests is greyed-out because you are importing it but not using it anywhere in your code. Seeing it underlined with red color, I would say that you maybe didn't even installed it? If you are using Requests, you don't need urllib. And about the error you are getting - that is definitely the result of some other library (PdfRead or PyPDF or similar)... Do you need those libraries at all? – errata May 12 '17 at 13:42
  • Not right now,my main aim is to download pdf's only. – WarLock May 12 '17 at 13:47
  • I tried the code I provided with few links to pdf files in csv and it works fine (files are downloaded). Be sure that you have installed Requests and uninstall PdfReader for now... – errata May 12 '17 at 13:49
  • is there any issue with 2.7 or 3 ? – WarLock May 12 '17 at 13:51
  • I'm on Mac, my csv file has just few lines to random PDF files on web... Updated my post with more info... – errata May 12 '17 at 13:53
  • My code snippet should work for both Python versions, I tried it out with 2.7.13 and 3.4.5. – errata May 12 '17 at 13:58
  • its works, Thank You so much. – WarLock May 12 '17 at 14:01
  • how to change directory of saving files? – WarLock May 12 '17 at 14:02
  • 1
    Don't contact me, post a question here and someone will answer pretty fast ;) For changing a folder, check my updated answer... Also, please accept the answer if it is correct for your use case! – errata May 12 '17 at 14:11
  • http://imgur.com/a/0CvHj here's the error. don't know how to fix it. – WarLock May 15 '17 at 07:21
  • @WarLock The image you posted seems like a new error? Please accept the answer if it solves your original problem. If you will have further issues, ask a new question, rather than updating this one! – errata May 15 '17 at 10:30
  • when one link is not working its stops the programme. please update so that it skip that link and move to second link. here's the screenshot of it. imgur.com/a/S9Xnu @errata – WarLock May 16 '17 at 09:17
  • @WarLock You can use `try` statement to "catch" an error in Python, check out [relevant documentation](https://docs.python.org/3.4/tutorial/errors.html). I updated my answer, now the program will continue in case of a problem with Requests. In case of any other error, you should catch it with another `except` block. I would suggest you reading [this post](http://stackoverflow.com/questions/16511337/correct-way-to-try-except-using-python-requests-module) to get a better idea about Requests exceptions. – errata May 16 '17 at 10:29
  • @WarLock Please, if this answer is solving your problem, accept it by clicking a checkmark next to it. If you will have some other problems, **ask a new question**, don't update current one with new issues! StackOverflow is a simple question-answer website, not a forum or a place for chatting :) Feel free to read a little bit about [Asking on StackOwerflow](https://stackoverflow.com/help/asking), it will improve your skills in searching for answers dramatically! ;) – errata May 16 '17 at 10:36
  • I was going to ask but, somehow I can't post another question its temporary ban. I understand that, But I have no other option @errata . – WarLock May 16 '17 at 12:10
  • http://stackoverflow.com/questions/44082209/how-can-i-change-prefix-of-downloading-file-which-is-define-before-every-link-i. @errata – WarLock May 22 '17 at 11:18