8

I was looking for a way to download pdf files in python, and I saw answers on other questions recommending the urllib module. I tried to download a pdf file using it, but when I try to open the downloaded file, a message shows up saying that the file cannot be opened.

error message

This is the code I used-

import urllib
urllib.urlretrieve("http://papers.gceguide.com/A%20Levels/Mathematics%20(9709)/9709_s11_qp_42.pdf", "9709_s11_qp_42.pdf")

What am I doing wrong? Also, the file automatically saves to the directory my python file is in. How do I change the location to which it gets saved?

Edit- I tried again with the link to a sample pdf, http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf

The code is working with this link, so why won't it work for the other one?

tiredandsarcastic
  • 105
  • 1
  • 1
  • 7
  • 2
    You can use `requests` for this task: http://stackoverflow.com/questions/34503412/download-and-save-pdf-file-with-python-requests-module – Kshitij Saraogi May 10 '17 at 12:10
  • @DavidZemens I won't call it a duplicate. The OP is concerned about his solution not working rather than finding a different one. – Kshitij Saraogi May 10 '17 at 12:13
  • 1
    When I go to that url I first get a captcha (by cloudflare) to prove that I'm not a robot and only then can access the pdf. Also cloudflare sites often restrict access based on user agent. If you open the file in a text editor you'll probably find html there instead of a pdf. – mata May 10 '17 at 12:17
  • 1
    You didn't actually download a PDF from that URL - you downloaded the CAPTCHA form needed to access the PDF. – jasonharper May 10 '17 at 12:18
  • So is there any way i can download files like that?? – tiredandsarcastic May 10 '17 at 12:30
  • You'd probably need to complete the captcha in a browser, take the cookies that were set and user agent from the browser and use those in your request. That may work for a while, but you may be presented with a new captcha after some time. – mata May 10 '17 at 12:45
  • @mata uhh how would you do that lmao – tiredandsarcastic May 10 '17 at 12:50
  • If you use the above mentioned [`requests`](http://python-requests.org) module, sending [cookies](http://docs.python-requests.org/en/master/user/quickstart/#cookies) and a custom [user agent](http://docs.python-requests.org/en/master/user/quickstart/#custom-headers) should be easy. Where to find them depends on your browser. – mata May 10 '17 at 12:55
  • Try a crawler, you will need tostar session on the website – tumbleweed May 10 '17 at 14:34

3 Answers3

13

Try this. It works.

import requests
url='https://pdfs.semanticscholar.org/c029/baf196f33050ceea9ecbf90f054fd5654277.pdf'
r = requests.get(url, stream=True)

with open('C:/Users/MICRO HARD/myfile.pdf', 'wb') as f:
f.write(r.content)
waka
  • 3,362
  • 9
  • 35
  • 54
Fensa Saj
  • 139
  • 1
  • 3
  • When I attempt to open the saved file, I get: "Adobe Acrobat Reader could not open 'D:/myfile.pdf' because it is either not a supported file type of because the file has been damaged..." – gotube Mar 22 '20 at 05:03
  • 1
    Turns out this code does work. The PDF at the url in the code above happens to be corrupt. Pointing it to the PDF I wanted worked fine – gotube Apr 18 '20 at 16:20
2

You can also use wget to download pdfs via a link:

import wget

wget.download(link)

Here's a guide about how to search & download all pdf files from a webpage in one go: https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

x89
  • 2,798
  • 5
  • 46
  • 110
0
  • You can't download the pdf content from the given url using requests or urllib.
  • Because initially the given url was pointed to another web page after that only it loads the pdf.
  • If you have doubt save the response as html instead of pdf.
  • You need to use headless browsers like panthomJS to download files from these kind of web pages.
Karthikeyan KR
  • 1,134
  • 1
  • 17
  • 38
  • How would a headless browser be of any use in this case? You still need to complete the captcha, which you can't do in a headless browser. – mata May 10 '17 at 15:06