Why can't I download from S3 using wget?

Question

When I put https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv into a browser, I can download a file no problem. But when I say,

wget.download('https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv', out='data/')

I get a 404 error. Is there something wrong with the format of that URL?

This is not a duplicate of HTTP Error 404: Not Found when using wget to download a link. wget works fine with other files. This appears to be something specific to S3 which is explained below.

How do you know the issue is because of the url and not, let's say, the headers? — DeepSpace, Dec 18 '17 at 12:36
I wouldn't even know how to check that. What would I even be looking for? — Bob Wakefield, Dec 18 '17 at 12:36
You could just use the requests package. requests.get(url) should do it. — Prateek Dewan, Dec 18 '17 at 12:38
Possible duplicate of [HTTP Error 404: Not Found when using wget to download a link](https://stackoverflow.com/questions/44828446/http-error-404-not-found-when-using-wget-to-download-a-link) — p-a-o-l-o, Dec 18 '17 at 12:44

score 2 · Accepted Answer · answered Dec 18 '17 at 12:52

The root cause is a bug in S3, as described here: https://stackoverflow.com/a/38285197/4323

One workaround is to use the requests library instead:

r = requests.get('https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv')

This works fine. You can inspect r.text or write it to a file. For the most efficient way, see https://stackoverflow.com/a/39217788/4323

1 Answers1