I have my own python crawler(based on CS101 from Udacity.com), trying to download files(installers) from download.cnet.com, When the crawler is crawling, I want it to work like this:
Tell if the link is a download link:
response = urllib2.urlopen('http://example.com/')
content_type = response.info().get('Content-Type')
print content_type
If the crawler gets:
application/octet-stream
- The crawler will download the installer from the link
The problem is download.com doesn't seem to provide the real download link, and my crawler can't find the download link from their dynamic links. For example, when I tried to download Opera in download.com, they do have message like this: "Your download will begin in a moment. If it doesn't, restart the download." But when I checked "restart the download" link, I was expecting to get real download link(e.g. download.com/blah/Opera.exe), instead I got some wierd address my crawler couldn't understand.
So I have confirmed from http://googlewebmastercentral.blogspot.no/2008/09/dynamic-urls-vs-static-urls.html that download.com is using dynamic links, but how should I do to in order to let my crawler find this link so it can download the installer from download.com?