2

Just curious: What do you find to be your best tools for creating automated screen scrapes these days? is the .Net Agility pack a good option? What do you do about scraping sites that use a lot of AJAX?

casperOne
  • 73,706
  • 19
  • 184
  • 253
JMarsch
  • 21,484
  • 15
  • 77
  • 125

4 Answers4

6

I find that if the page has a pretty static layout, then the HTML Agility Pack is perfect for getting all the data I need. I've not run into a single page that it hasn't been able to handle and not get me the results I wanted.

If you find that the page is rendered with a great deal of dynamic code, you're going to have to do more than just download the page, you'll have to actually execute it.

To do that, you'll need something like the WebKit .NET library (a .NET wrapper around the WebKit rendering engine) which will allow you to download the page and actually execute Javascript as well. Then, once you are sure the document has been rendered completely, you can get the page details.

carla
  • 1,970
  • 1
  • 31
  • 44
casperOne
  • 73,706
  • 19
  • 184
  • 253
  • What about Internet Explorer? – Ivan G. Aug 12 '14 at 09:08
  • 1
    @aloneguid It's pretty much a bad idea because you'd have to use MSHTML, which means you'd then need a single-threaded COM apartment, and on the server side that becomes a huge issue. – casperOne Aug 12 '14 at 12:58
4

For the very basics I use:

I don't have JavaScript enabled yet, but I'm planning on using Google's V8 JavaScript Engine. This requires that you make calls to unmanaged code, but the performance of V8 justifies it.

Community
  • 1
  • 1
Kiril
  • 39,672
  • 31
  • 167
  • 226
  • Not sure why you would use the Asynchronous HTTP Client when there is built in support for async operations on the HttpWebRequest/HttpWebResponse instances. These operations drop down to the methods on the network level that use IO completion ports, not just using blocking sockets on another thread. – casperOne Sep 22 '11 at 14:22
  • 1
    @casperOne my implementation of the asynchronous HTTP client uses completion ports too. I made 8 runs with HttpWeb* vs Async Http Client on an Amazon EC2 machine (High CPU 8 virtual cores/20 computation units and 7GB RAM) and the HttpWeb* based client was about 25% slower. HttpWeb* got an average of about 90 web pages per second where as the Async Http Client got about 120 pages per second. – Kiril Sep 22 '11 at 14:54
  • Fair enough, but "much" is subjective; I don't qualify 25% as "much". You should put more quantitative values in your answer, IMO. – casperOne Sep 22 '11 at 15:02
  • 1
    @casperOne, I don't recall what the HttpWeb* results were in our data center, but the Async socket was cranking out more than 350 pages per second. Perhaps "much faster" is a bit strong, I'll change it to "notably faster." – Kiril Sep 22 '11 at 15:10
0

The best tool "these days" is one that not only gives you the desired features (Javascript, automation), but also the one that you don't have to run yourself... I am, of course, alluding to using a cloud service. This approach will save you network bandwidth, will deliver results faster (because it can scale better than a custom solution you'll likely end up developing) and, most importantly, save you the IT and maintenance headache.

On that note, check out a scraping solution called Bobik (http://usebobik.com). I've written an article about it at http://zscraper.wordpress.com/2012/07/03/a-comparison-shopping-android-app-without-backend/.

Hope this helps.

Yevgeniy
  • 1,313
  • 2
  • 13
  • 26
0

For automating screen scraping, Selenium is a good tool. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server

After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.

Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.

I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html

In the above link select the option of regular download.

I spent good amount of time in figuring it out, so thought it may save somebody's time.

jeff musk
  • 1,032
  • 1
  • 10
  • 31