Download 1000's PDF's links listed in csv with python Request module.
Asked
Active
Viewed 476 times
-2
-
Are you able to add external package to your project or you have to use `urllib`? – errata May 12 '17 at 11:08
1 Answers
3
I would suggest you to use Requests, then you can do something along the lines of:
import os
import csv
import requests
write_path = 'folder_name' # ASSUMING THAT FOLDER EXISTS!
with open('x.csv', 'r') as csvfile:
spamreader = csv.reader(csvfile)
for link in spamreader:
print('-'*72)
pdf_file = link[0].split('/')[-1]
with open(os.path.join(write_path, pdf_file), 'wb') as pdf:
try:
# Try to request PDF from URL
print('TRYING {}...'.format(link[0]))
a = requests.get(link[0], stream=True)
for block in a.iter_content(512):
if not block:
break
pdf.write(block)
print('OK.')
except requests.exceptions.RequestException as e: # This will catch ONLY Requests exceptions
print('REQUESTS ERROR:')
print(e) # This should tell you more details about the error
Testing content of x.csv
is:
https://www.pabanker.com/media/3228/qtr1pabanker_final-web.pdf
http://www.pdf995.com/samples/pdf.pdf
https://tcd.blackboard.com/webapps/dur-browserCheck-BBLEARN/samples/sample.pdf
http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf
Sample output:
$ python test.py
------------------------------------------------------------------------
TRYING https://www.pabanker.com/media/3228/qtr1pabanker_final-web.pdf...
REQUESTS ERROR:
("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))
------------------------------------------------------------------------
TRYING http://www.pdf995.com/samples/pdf.pdf...
OK.
------------------------------------------------------------------------
TRYING https://tcd.blackboard.com/webapps/dur-browserCheck-BBLEARN/samples/sample.pdf...
OK.
------------------------------------------------------------------------
TRYING http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf...
OK.

errata
- 5,695
- 10
- 54
- 99
-
I tried to read csv by using urllib. But unable to read csv file it says PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736] and import request is black. why? – WarLock May 12 '17 at 13:10
-
But this is `PdfRead` library error... Can you post the code snippet which results in this error? – errata May 12 '17 at 13:14
-
-
-
no same error Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736] yes I have specific pdf links in csv file. – WarLock May 12 '17 at 13:35
-
Requests is greyed-out because you are importing it but not using it anywhere in your code. Seeing it underlined with red color, I would say that you maybe didn't even installed it? If you are using Requests, you don't need urllib. And about the error you are getting - that is definitely the result of some other library (PdfRead or PyPDF or similar)... Do you need those libraries at all? – errata May 12 '17 at 13:42
-
-
I tried the code I provided with few links to pdf files in csv and it works fine (files are downloaded). Be sure that you have installed Requests and uninstall PdfReader for now... – errata May 12 '17 at 13:49
-
-
I'm on Mac, my csv file has just few lines to random PDF files on web... Updated my post with more info... – errata May 12 '17 at 13:53
-
My code snippet should work for both Python versions, I tried it out with 2.7.13 and 3.4.5. – errata May 12 '17 at 13:58
-
-
-
1Don't contact me, post a question here and someone will answer pretty fast ;) For changing a folder, check my updated answer... Also, please accept the answer if it is correct for your use case! – errata May 12 '17 at 14:11
-
-
@WarLock The image you posted seems like a new error? Please accept the answer if it solves your original problem. If you will have further issues, ask a new question, rather than updating this one! – errata May 15 '17 at 10:30
-
when one link is not working its stops the programme. please update so that it skip that link and move to second link. here's the screenshot of it. imgur.com/a/S9Xnu @errata – WarLock May 16 '17 at 09:17
-
@WarLock You can use `try` statement to "catch" an error in Python, check out [relevant documentation](https://docs.python.org/3.4/tutorial/errors.html). I updated my answer, now the program will continue in case of a problem with Requests. In case of any other error, you should catch it with another `except` block. I would suggest you reading [this post](http://stackoverflow.com/questions/16511337/correct-way-to-try-except-using-python-requests-module) to get a better idea about Requests exceptions. – errata May 16 '17 at 10:29
-
@WarLock Please, if this answer is solving your problem, accept it by clicking a checkmark next to it. If you will have some other problems, **ask a new question**, don't update current one with new issues! StackOverflow is a simple question-answer website, not a forum or a place for chatting :) Feel free to read a little bit about [Asking on StackOwerflow](https://stackoverflow.com/help/asking), it will improve your skills in searching for answers dramatically! ;) – errata May 16 '17 at 10:36
-
I was going to ask but, somehow I can't post another question its temporary ban. I understand that, But I have no other option @errata . – WarLock May 16 '17 at 12:10
-
http://stackoverflow.com/questions/44082209/how-can-i-change-prefix-of-downloading-file-which-is-define-before-every-link-i. @errata – WarLock May 22 '17 at 11:18