-2

I have my own python crawler(based on CS101 from Udacity.com), trying to download files(installers) from download.cnet.com, When the crawler is crawling, I want it to work like this:

  1. Tell if the link is a download link:

    response = urllib2.urlopen('http://example.com/')

    content_type = response.info().get('Content-Type')

    print content_type

  2. If the crawler gets:

    application/octet-stream
    
  3. The crawler will download the installer from the link

The problem is download.com doesn't seem to provide the real download link, and my crawler can't find the download link from their dynamic links. For example, when I tried to download Opera in download.com, they do have message like this: "Your download will begin in a moment. If it doesn't, restart the download." But when I checked "restart the download" link, I was expecting to get real download link(e.g. download.com/blah/Opera.exe), instead I got some wierd address my crawler couldn't understand.

So I have confirmed from http://googlewebmastercentral.blogspot.no/2008/09/dynamic-urls-vs-static-urls.html that download.com is using dynamic links, but how should I do to in order to let my crawler find this link so it can download the installer from download.com?

Deming
  • 1,210
  • 12
  • 15

1 Answers1

1

As you've said, it is likely that you're getting JavaScript or AJAX in the page which activates the download in a "real" browser while stymying your efforts to simply automate it.

Here's another discussion of the same issue: StackOverflow: Mechanize and JavaScript. As noted there, one option would be to use an alternative to Python such as PhantomJS or a browser automation framework (with optional "remote control") such as Selenium.

Community
  • 1
  • 1
Jim Dennis
  • 17,054
  • 13
  • 68
  • 116
  • I agree. I would use a headless browser crawler like PhantomJS or HTMLUnitDriver , as part of Selenium. – djangofan Apr 06 '13 at 21:30