1

I am trying to download a file with urllib. I am using a direct link to this rar (if I use chrome on this link, it will immediately start downloading the rar file), but when i run the following code :

file_name = url.split('/')[-1]
u = urllib.urlretrieve(url, file_name)

... all I get back is a 22kb rar file, which is obviously wrong. What is going on here? Im on OSX Mavericks w/ python 2.7.5, and here is the url.

(Disclaimer : this is a free download, as seen on the band's website

sbmsr
  • 133
  • 1
  • 10
  • Have you tried looking at the zip file or calling `file` on it? – Nick Beeuwsaert Jan 09 '14 at 21:23
  • 1
    Would be helpful to see the URL that you are using for this in order to troubleshoot. – Chris Simpkins Jan 09 '14 at 21:25
  • to get filename from an url, `urlparse`, `posixpath` modules might help. See [`url2filename()` function](http://stackoverflow.com/a/20478401/4279). – jfs Jan 09 '14 at 21:29
  • 2
    the site might return a different content than for a web browser (no javascript, no cookies). Check the downloaded file. It might be an html page with an error message. – jfs Jan 09 '14 at 21:31
  • @ChrisSimpkins just added url to my question – sbmsr Jan 09 '14 at 22:34

2 Answers2

1

Got it. The headers were lacking alot of information. I resorted to using Requests, and with each GET request, I would add the following content to the header :

'Connection': 'keep-alive'
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML,     like Gecko) Chrome/31.0.1650.63 Safari/537.36'
'Cookie': 'JSESSIONID=36DAD704C8E6A4EF4B13BCAA56217961; ziplocale=en; zippop=2;'

However, I noticed that not all of this is necessary (just the Cookie is all you need), but it did the trick - I was able to download the entire file. If using urllib2 I am sure that doing the same (sending requests with the appropriate header content) would do the trick. Thank you all for the good tips, and for pointing me in the right direction. I used Fiddlr to see what my Requests GET header was missing in comparison to chrome's GET header. If you have a similar issue like mine, I suggest you check it out.

sbmsr
  • 133
  • 1
  • 10
0

I tried this with Python using the following code that replaces urlib with urllib2:

url = "http://www29.zippyshare.com/d/12069311/2695/Del%20Paxton-Worst.%20Summer.%20Ever%20EP%20%282013%29.rar"

import urllib2

file_name = url.split('/')[-1]
response = urllib2.urlopen(url)
data = response.read()
with open(file_name, 'wb') as bin_writer:
    bin_writer.write(data)

and I get the same 22k file. Trying it with wget on that URL yields the same file; however I was able to begin the download of the full file (around 35MB as I recall) by pasting the URL in the Chrome navigation bar. Perhaps they are serving different files based upon the headers that you are sending in your request? The User-Agent GET request header is going to look different to their server (i.e. not like a browser) from Python/wget than it does from your browser when you click on the button.

I did not open the .rar archives to inspect the two files.

This thread discusses setting headers with urllib2 and this is the Python documentation on how to read the response status codes from your urllib2 request which could be helpful as well.

Community
  • 1
  • 1
Chris Simpkins
  • 1,534
  • 2
  • 11
  • 13
  • Thanks Chris, I realize that i am being redirected to this [link](http://www29.zippyshare.com/v/12069311/file.html). I copied and pasted my chrome User-Agent info into my request header, but keep getting redirected. I'll keep trying. Thank you thus far. – sbmsr Jan 10 '14 at 00:37