1

I am helping somebody pull a bunch (tens of thousands) of pdf files from a website. We have the pattern for the file names but not all of the files will exist. I am assuming it is rude to ask for a file that does not exist, particularly at this scale. I am using python and in my tests of urllib2 I have found that this snippet gets me the file if it exists

s=urllib.urlretrieve('http://website/directory/filename.pdf','c:\\destination.pdf')

If the file does not exist then I get a file that has the name I assigned but the text from their 404 page. Now I can handle this after I am done (read the files and delete all of the 404 pages) but that does not seem very nice to their server nor is it very pythonic.

I tried messing with the looking at the various functions in urllib and urlretrieve and do not see anything that tells me if the file exists.

PyNEwbie
  • 4,882
  • 4
  • 38
  • 86
  • 12
    What's rude is pulling tens of thousands of PDF files. That little extra rudeness of some of the files not existing...eh. Doesn't even matter, next to that. – cHao Apr 03 '12 at 18:59
  • Well we are going to do it when their traffic is down (weekends) and they don't have a restriction the files are there to read but for his research we need to collect a large number – PyNEwbie Apr 03 '12 at 19:01
  • It's actually very pythonic - _ask for forgiveness, not permission_ - the pythonic (and only, given the way the web works) thing to do is catch the 404s. I'd also like to note they are not filenames, they are URLs - there is a difference, a URL does not mean there is an actual file on the server - they could be generated from a databased or whatever. – Gareth Latty Apr 03 '12 at 19:01
  • 1
    @Lattyware: In which case you're not just taking up bandwidth; you're also making the server generate this PDF. Even ruder. – cHao Apr 03 '12 at 19:02
  • I can't seem to catch them before they are written to disk- I don't mind but if there was a better way I wanted to learn it – PyNEwbie Apr 03 '12 at 19:03
  • 2
    @PyNEwbie They are going to send the full response no matter what, so whether or not you write it to disk is really more of an issue for your end then the server. – Gareth Latty Apr 03 '12 at 19:05

1 Answers1

6

You can check the return code of the response. It will be 200 for existing PDFs and 404 for non-existing PDFs. You can use the requests library to make this a lot easier:

>>> import requests
>>> r = requests.get('http://cdn.sstatic.net/stackoverflow/img/sprites.png')
>>> r.status_code
200
>>> r = requests.get('http://cdn.sstatic.net/stackoverflow/img/sprites.xxx')
>>> r.status_code
404
jterrace
  • 64,866
  • 22
  • 157
  • 202