Downloading links with Python urllib2

Question

I want to download an mp3 off of a page, but what I'm getting is just the html, and not the mp3 itself. The code I'm using is from this link here: https://stackoverflow.com/a/16518224/2137668

Why am I not able to get the mp3? Here's an example for testing that shows it gets downloaded as html: http://www5.zippyshare.com/d/77609120/61098/Cleavage%20-%20Prove%20%28Original%20Mix%29%20%5bquality-dance-music.com%5d.mp3

Many websites give you "container pages" when you browse to things like images, songs, and videos. This may be to improve your user experience, to make it harder for other sites to "deep-link" their content, or to make it harder for you to "steal" their content. — abarnert, Nov 18 '14 at 19:57
If you click the link it immediately tries to download the mp3, in chrome. Just checked Safari, and now see the redirect you were talking about. — yoyodunno, Nov 18 '14 at 20:00
If I click the link, it takes me to a web page where I can listen to the song on the page, click another link to download it, like it on Facebook, etc. — abarnert, Nov 18 '14 at 20:02
Yeah I see the redirect now when I used Safari. But when using chrome, the link just pops up the download popup. — yoyodunno, Nov 18 '14 at 20:05
I doubt the difference is Chrome vs. Safari. More likely it's a difference in how you followed the link. It may be whether you clicked the link from the HTML page vs. from somewhere else (based on the `Referer` header), or whether you clicked the link after getting a cookie by doing something on the HTML page, or who knows what. See my answer for more details. — abarnert, Nov 18 '14 at 20:07
I tried the exact same steps for fun, clicking the link from the original referer, and still get the same differing in behavior between the browsers. Thanks for the info, I am pretty decent with javascript so maybe I can come up with something. :) — yoyodunno, Nov 18 '14 at 20:13
Well, I was testing with Chrome when I got the HTML page, so I'm pretty sure it's not just the browser. But it's possible that one of the other things it checks is your User-Agent, refusing to let you download without a "real browser", and it's buggy enough not to recognize Safari. Anyway, you can read through the JS code on the page, and use the nifty web debugging tools that both Safari and Chrome include to see exactly what's being requested (including headers and form data, if relevant) when you get the download. — abarnert, Nov 18 '14 at 20:20
But again, remember that there's a good chance you're violating their ToS by trying to get around their protections, which could even be illegal or actionable where you live, so… check into that and make sure either it's OK, or you don't mind not being OK, before putting too much work into the technical side. — abarnert, Nov 18 '14 at 20:21
Hmm yeah that is weird. Looking at the code I wouldn't be surprised if it's buggy between browsers, it looks a bit sloppy haha. — yoyodunno, Nov 18 '14 at 20:26
My bank didn't let me use their site with Chrome for a couple months because their Chrome-detecting regex expected a single-digit major version number, and anything WebKit that didn't match Safari, Mobile Safari, or Chrome was Palm WebOS and therefore not supported… — abarnert, Nov 18 '14 at 20:41

abarnert · Accepted Answer · 2014-11-18T20:08:28.463

When I try to open that URL in a web browser, or with wget, I get a 302 redirection to http://www5.zippyshare.com/v/77609120/file.html, which is of course an HTML page.

Many websites redirect you to such "container pages" (or just return them directly) when you browse to things like images, songs, and videos. This may be to improve your user experience, to make it harder for other sites to "deep-link" their content, or to make it harder for you to "steal" their content.

If it's one of the first two, often the answer is trivial: add a Referer header that points to the download page you got the link from (or, sometimes, to anything on the same site—even the same URL you're downloading).

If it's the third, they will usually put a lot more protection on than that. For just one example, they may require you to have a cookie that you got from sitting on the download page and waiting out a 30-second timer and that's only valid for 30 minutes.

If you understand HTTP and JavaScript well enough, and don't care about violating their terms of service, you can usually reverse-engineer each of their protections and write yourself a download script that'll work until they change things up next month, but that's usually not worth doing.

Anyway, given that this site is named zippyshare, I'm guessing it's the last of these. These kinds of sites make their money by showing you ads every time you download a file, and by prompting you to pay a monthly fee to get direct/accelerated/whatever downloads, and so on, so they will put all kinds of hurdles in the way of you downloading files directly without seeing those ads or paying that fee.

Downloading links with Python urllib2

1 Answers1