0

I am downloading multiple CSV files from a website using Python. I would like to be able to check the response code on each request.

I know how to download the file using wget, but not how to check the response code:

os.system('wget http://example.com/test.csv')

I've seen a lot of people suggesting using requests, but I'm not sure that's quite right for my use case of saving CSV files.

r = request.get('http://example.com/test.csv')
r.status_code # 200
# Pipe response into a CSV file... hm, seems messy?

What's the neatest way to do this?

Richard
  • 62,943
  • 126
  • 334
  • 542
  • Check http://stackoverflow.com/questions/2467609/using-wget-via-python and perhaps https://docs.python.org/2/library/urllib.html#urllib.FancyURLopener – Tim Mar 09 '15 at 23:30
  • I don't see anything particular wrong with the requests approach: you could alternatively use [urllib.urlretrieve](https://docs.python.org/2/library/urllib.html#urllib.urlretrieve) and check the header returned after – Jon Clements Mar 09 '15 at 23:31
  • The natural question that arises from your posting: what do you want to do if the status_code is not 200? Do you want to throw the (partial/corrupt) data away? Move the suspect files into a different directory, write the URLs for those into some sort of error log? What you do with the status is a policy decision but guides the structure of the code around it. – Jim Dennis Mar 09 '15 at 23:44
  • @JimDennis thanks for this. I'm writing a script that will let people download a lot of data, and I need it to warn them if any of the data is in any way corrupt or incomplete. So I guess the answer is "print a warning and move the file". – Richard Mar 17 '15 at 10:04
  • I would recommend that you open the file via a temporary name (use the `tempfile` module's NamedTemporaryFile() static method) then then rename it only if the transfer is successful. If there's an older version of the file present I'd use a "link dance" to hard link it to a ".old" or ".$(date ...)" name, then hard link the old name to the temporary file (then unlinking the temp. file leaving only the good file). Using this process will provide the best data integrity guarantees. – Jim Dennis Mar 17 '15 at 19:20

1 Answers1

0

You can use the stream argument - along with iter_content() it's possible to stream the response contents right into a file (docs):

import requests

r = None
try:
    r = requests.get('http://example.com/test.csv', stream=True)
    with open('test.csv', 'w') as f:
        for data in r.iter_content():
            f.write(data)

finally:
    if r is not None:
        r.close()
Maciej Gol
  • 15,394
  • 4
  • 33
  • 51
  • I think this is basically what OP means by *Pipe response into a CSV file* – Tim Mar 09 '15 at 23:32
  • @TimCastelijns, yeah, the status code part has already been covered by the OP – Maciej Gol Mar 09 '15 at 23:33
  • I mean I'm pretty sure he's looking for a way that doesn't involve manually storing the result in a CSV with python code – Tim Mar 09 '15 at 23:34
  • @TimCastelijns, if by manually you mean there is no simple one-liner for this, just create a utils function that does exactly that. Other than that, I think it's perfectly fine to download the file from inside Python. – Maciej Gol Mar 09 '15 at 23:42
  • I also think that's fine - don't get me wrong. I just think that OP knows he can do it like this, but doesn't want to because it seems messy to him – Tim Mar 09 '15 at 23:47
  • @TimCastelijns, guess we will need to wait for the OP to give his judgement ;-) – Maciej Gol Mar 09 '15 at 23:48
  • Thanks for this, everyone. If there is no way that doesn't involve piping the file, then that is what I will do! Basically I want my script to exit if there is *any* problem with the file: a bad request code, or a problem during streaming. So it would be good to know if there is any way of making the code above more robust, too. – Richard Mar 17 '15 at 10:03