How to download pdf files using Python?

Question

I was looking for a way to download pdf files in python, and I saw answers on other questions recommending the urllib module. I tried to download a pdf file using it, but when I try to open the downloaded file, a message shows up saying that the file cannot be opened.

error message

This is the code I used-

import urllib
urllib.urlretrieve("http://papers.gceguide.com/A%20Levels/Mathematics%20(9709)/9709_s11_qp_42.pdf", "9709_s11_qp_42.pdf")

What am I doing wrong? Also, the file automatically saves to the directory my python file is in. How do I change the location to which it gets saved?

Edit- I tried again with the link to a sample pdf, http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf

The code is working with this link, so why won't it work for the other one?

You can use `requests` for this task: http://stackoverflow.com/questions/34503412/download-and-save-pdf-file-with-python-requests-module — Kshitij Saraogi, May 10 '17 at 12:10
@DavidZemens I won't call it a duplicate. The OP is concerned about his solution not working rather than finding a different one. — Kshitij Saraogi, May 10 '17 at 12:13
When I go to that url I first get a captcha (by cloudflare) to prove that I'm not a robot and only then can access the pdf. Also cloudflare sites often restrict access based on user agent. If you open the file in a text editor you'll probably find html there instead of a pdf. — mata, May 10 '17 at 12:17
You didn't actually download a PDF from that URL - you downloaded the CAPTCHA form needed to access the PDF. — jasonharper, May 10 '17 at 12:18
You'd probably need to complete the captcha in a browser, take the cookies that were set and user agent from the browser and use those in your request. That may work for a while, but you may be presented with a new captcha after some time. — mata, May 10 '17 at 12:45
If you use the above mentioned [`requests`](http://python-requests.org) module, sending [cookies](http://docs.python-requests.org/en/master/user/quickstart/#cookies) and a custom [user agent](http://docs.python-requests.org/en/master/user/quickstart/#custom-headers) should be easy. Where to find them depends on your browser. — mata, May 10 '17 at 12:55

score 13 · Answer 1 · edited Aug 14 '17 at 10:26

13

Try this. It works.

import requests
url='https://pdfs.semanticscholar.org/c029/baf196f33050ceea9ecbf90f054fd5654277.pdf'
r = requests.get(url, stream=True)

with open('C:/Users/MICRO HARD/myfile.pdf', 'wb') as f:
f.write(r.content)

edited Aug 14 '17 at 10:26

waka

3,362
9
35
54

answered Aug 14 '17 at 08:40

Fensa Saj

139
1
3

When I attempt to open the saved file, I get: "Adobe Acrobat Reader could not open 'D:/myfile.pdf' because it is either not a supported file type of because the file has been damaged..." – gotube Mar 22 '20 at 05:03
1

Turns out this code does work. The PDF at the url in the code above happens to be corrupt. Pointing it to the PDF I wanted worked fine – gotube Apr 18 '20 at 16:20

score 2 · Answer 2 · answered Dec 24 '20 at 09:21

You can also use wget to download pdfs via a link:

import wget

wget.download(link)

Here's a guide about how to search & download all pdf files from a webpage in one go: https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

score 0 · Answer 3 · answered May 10 '17 at 13:52

0

You can't download the pdf content from the given url using requests or urllib.
Because initially the given url was pointed to another web page after that only it loads the pdf.
If you have doubt save the response as html instead of pdf.
You need to use headless browsers like panthomJS to download files from these kind of web pages.

answered May 10 '17 at 13:52

Karthikeyan KR

1,134
1
17
38

How would a headless browser be of any use in this case? You still need to complete the captcha, which you can't do in a headless browser. – mata May 10 '17 at 15:06

How to download pdf files using Python?

3 Answers3