2

I am trying to write a Python script to crawl a rock climbing rankings website, and the website is using a combination of redirects and frames which confuses every attempt I've made at accessing the data at the URL. I've tried a few different crawler scripts, as well as curl on the command line, and none of them gets anything more than an empty document.

For reference, an example of the types of URL I am attempting to access is something like this:

http://www.8a.nu/Scorecard/AscentList.aspx?UserId=1476&AscentType=0&AscentClass=0&AscentListTimeInterval=1&AscentListViewType=0&GID=ea0fb3b90e4b0b655580384e07974b38

Which redirects to this URL:

http://www.8a.nu/?IncPage=http%3A//www.8a.nu/Scorecard/AscentList.aspx%3FUserId%3D1476%26AscentType%3D0%26AscentClass%3D0%26AscentListTimeInterval%3D1%26AscentListViewType%3D0%26GID%3Dea0fb3b90e4b0b655580384e07974b38

Which is, itself, a page containing several frames. Extra-confusingly, the author uses javascript to redirect to the main frame again if you try to view the frame by itself.

It seems as if the web server is refusing to serve any data for the contents of the frame, unless it is actually enclosed in that frame. This is making it extremely difficult to programatically access the contents of the frame. And advice on how I can get at the contents of this frame would be hugely appreciated. At a deeper, more conceptual level, how the heck does the website know to refuse to serve the document when it's not in a frame?

James
  • 63
  • 4

3 Answers3

0

The reason why you are just getting no response from curl (i.e. zero Content-Length) is because they most likely have a mechanism to reject requests from what appears to be a bot/spider.

They would do this based on the User-Agent header and possibly the Referer header.

You can easily circumvent this by specifying these along with the request like so:

wget --header="User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" --referer="http://www.8a.nu/?IncPage=http%3A//www.8a.nu/Scorecard/AscentList.aspx%3FUserId%3D1476%26AscentType%3D0%26AscentClass%3D0%26AscentListTimeInterval%3D1%26AscentListViewType%3D0%26GID%3Dea0fb3b90e4b0b655580384e07974b38" "http://www.8a.nu/Scorecard/AscentList.aspx?UserId=1476&AscentType=0&AscentClass=0&AscentListTimeInterval=1&AscentListViewType=0&GID=ea0fb3b90e4b0b655580384e07974b38"

Note the --header option to add the User-Agent header and the --referer option to add the Referer. This example uses wget but you can easily use other approaches and simply set those 2 headers.

Since the site is using Javascript to do the redirects (and not a 301 or 302 response code), you don't need to worry about your code automatically following redirects and you should be able to programatically get the content.

You might need to first load the outer page programatically and store a session based ID for future requests if the site refuses requests with an old ID.

Edit: I just had a look at http://www.8a.nu/js/commonScripts.js and they are detecting if the page is not loaded within an iframe with this javascript file.

cosjav
  • 2,095
  • 1
  • 17
  • 17
0

Maybe you need use headless browser to solve this problem.

A brief explanation of headless browser is here.

As far as I know, there are three ways to achieve headless browser, namely:

  • qtwebkit : Here's an example
  • selenium: Here's an example
  • phantomjs: Here's the site. Its documentation is very clear and easy to use.

Those three could capture web pages like a web browser and save you a lot of time and code to handle different problems that the web site causes for preventing spiders from crawling the site.

I recommend you phantomjs. It's fast and easy to use. Here is a question about integrating python and phantomjs.

A tip: when using phantomjs, it will be better to use Squid to avoid some http requests so that it can accelerate the process.

Community
  • 1
  • 1
flyer
  • 9,280
  • 11
  • 46
  • 62
0

I am doing the exact thing on yesterday. I have 4,000 websites want to capture. Without squid, I can get screenshot of each one in 2.5 seconds (6 threads). With Squid, it take me 7.8 seconds. Without Squid, it only fail 3 time of 4,000 websites. With Squid, it fails 179 times.

Conclusion: Squid is useful when you have a lot of similar page to access. But not that good when you have 4,000 totally different websites. Please do not use it in this case.

Chen
  • 66
  • 3