Download PDF's links listed in csv with python Request module

Question

Download 1000's PDF's links listed in csv with python Request module.

Are you able to add external package to your project or you have to use `urllib`? — errata, May 12 '17 at 11:08

errata · Accepted Answer · 2017-05-16T10:29:17.340

3

I would suggest you to use Requests, then you can do something along the lines of:

import os
import csv
import requests

write_path = 'folder_name'  # ASSUMING THAT FOLDER EXISTS!

with open('x.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile)
    for link in spamreader:
        print('-'*72)
        pdf_file = link[0].split('/')[-1]
        with open(os.path.join(write_path, pdf_file), 'wb') as pdf:
            try:
                # Try to request PDF from URL
                print('TRYING {}...'.format(link[0]))
                a = requests.get(link[0], stream=True)
                for block in a.iter_content(512):
                    if not block:
                        break

                    pdf.write(block)
                print('OK.')
            except requests.exceptions.RequestException as e:  # This will catch ONLY Requests exceptions
                print('REQUESTS ERROR:')
                print(e)  # This should tell you more details about the error

Testing content of x.csv is:

https://www.pabanker.com/media/3228/qtr1pabanker_final-web.pdf
http://www.pdf995.com/samples/pdf.pdf
https://tcd.blackboard.com/webapps/dur-browserCheck-BBLEARN/samples/sample.pdf
http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf

Sample output:

$ python test.py
------------------------------------------------------------------------
TRYING https://www.pabanker.com/media/3228/qtr1pabanker_final-web.pdf...
REQUESTS ERROR:
("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))
------------------------------------------------------------------------
TRYING http://www.pdf995.com/samples/pdf.pdf...
OK.
------------------------------------------------------------------------
TRYING https://tcd.blackboard.com/webapps/dur-browserCheck-BBLEARN/samples/sample.pdf...
OK.
------------------------------------------------------------------------
TRYING http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf...
OK.

edited May 16 '17 at 10:29

answered May 12 '17 at 11:19

errata

5,695
10
54
99

I tried to read csv by using urllib. But unable to read csv file it says PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736] and import request is black. why? – WarLock May 12 '17 at 13:10
But this is `PdfRead` library error... Can you post the code snippet which results in this error? – errata May 12 '17 at 13:14
here's the screen shot of code and error http://imgur.com/a/kQaXZ – WarLock May 12 '17 at 13:26
I just updated my answer, check if that's what you need... – errata May 12 '17 at 13:27
no same error Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736] yes I have specific pdf links in csv file. – WarLock May 12 '17 at 13:35
Requests is greyed-out because you are importing it but not using it anywhere in your code. Seeing it underlined with red color, I would say that you maybe didn't even installed it? If you are using Requests, you don't need urllib. And about the error you are getting - that is definitely the result of some other library (PdfRead or PyPDF or similar)... Do you need those libraries at all? – errata May 12 '17 at 13:42
Not right now,my main aim is to download pdf's only. – WarLock May 12 '17 at 13:47
I tried the code I provided with few links to pdf files in csv and it works fine (files are downloaded). Be sure that you have installed Requests and uninstall PdfReader for now... – errata May 12 '17 at 13:49
is there any issue with 2.7 or 3 ? – WarLock May 12 '17 at 13:51
I'm on Mac, my csv file has just few lines to random PDF files on web... Updated my post with more info... – errata May 12 '17 at 13:53
My code snippet should work for both Python versions, I tried it out with 2.7.13 and 3.4.5. – errata May 12 '17 at 13:58
its works, Thank You so much. – WarLock May 12 '17 at 14:01
how to change directory of saving files? – WarLock May 12 '17 at 14:02
1

Don't contact me, post a question here and someone will answer pretty fast ;) For changing a folder, check my updated answer... Also, please accept the answer if it is correct for your use case! – errata May 12 '17 at 14:11
http://imgur.com/a/0CvHj here's the error. don't know how to fix it. – WarLock May 15 '17 at 07:21
@WarLock The image you posted seems like a new error? Please accept the answer if it solves your original problem. If you will have further issues, ask a new question, rather than updating this one! – errata May 15 '17 at 10:30
when one link is not working its stops the programme. please update so that it skip that link and move to second link. here's the screenshot of it. imgur.com/a/S9Xnu @errata – WarLock May 16 '17 at 09:17
@WarLock You can use `try` statement to "catch" an error in Python, check out [relevant documentation](https://docs.python.org/3.4/tutorial/errors.html). I updated my answer, now the program will continue in case of a problem with Requests. In case of any other error, you should catch it with another `except` block. I would suggest you reading [this post](http://stackoverflow.com/questions/16511337/correct-way-to-try-except-using-python-requests-module) to get a better idea about Requests exceptions. – errata May 16 '17 at 10:29
@WarLock Please, if this answer is solving your problem, accept it by clicking a checkmark next to it. If you will have some other problems, **ask a new question**, don't update current one with new issues! StackOverflow is a simple question-answer website, not a forum or a place for chatting :) Feel free to read a little bit about [Asking on StackOwerflow](https://stackoverflow.com/help/asking), it will improve your skills in searching for answers dramatically! ;) – errata May 16 '17 at 10:36
I was going to ask but, somehow I can't post another question its temporary ban. I understand that, But I have no other option @errata . – WarLock May 16 '17 at 12:10
http://stackoverflow.com/questions/44082209/how-can-i-change-prefix-of-downloading-file-which-is-define-before-every-link-i. @errata – WarLock May 22 '17 at 11:18

Download PDF's links listed in csv with python Request module

1 Answers1