3

How would I go about downloading and executing (i.e evaluate Javascript, build DOM) in excess of 1000 XHTML documents per minute?

Some outlines/constraints:

  • URLs to be downloaded are on different servers.
  • I need to traverse - and ideally modify the resulting DOM.
  • No interest in rendering the graphics.
  • Bandwidth is not an issue.
  • Overly massive hardware parallelization would be more of a problem.
  • Production enviroment is .NET.

I am not so concerned about downloading the pages. I estimate that actually excuting the page is a bottleneck. .NET has a built in Web Browser object but I have no idea if it would scale up on a single machine. Also, .NET is not an absolute requirement but it would make integration around here easier.

I'd be grateful for any comments/pointers regarding:

  • Which browser API is most suited to do this?
  • Is a browser the right way to go - maybe there's a more lightweight way to execute the Javascript which is the most important part (... but does not provide a DOM)?
  • What existing products/services - be they open source or commerical - may accomplish the task?
  • Roughly how many pages per minute I can expect to handle on a single machine (3ms Chrome rendering commercial anyone)?
  • Any pitfalls one is likely to encounter...

Thank you in advance,

/David

OG Dude
  • 936
  • 2
  • 12
  • 22
  • Start by purchasing a really, really big computer :-) If you don't do it in a browser, it's going to be really hard to ensure that the pages work properly; any JavaScript code is *very* likely to assume it can do normal DOM manipulations. – Pointy Feb 01 '11 at 15:11
  • Oh, and the throughput is definitely going to depend on the metrics for these "pages" and the nature of the JavaScript code on them. – Pointy Feb 01 '11 at 15:14
  • Is this something you'd be running occasionally, like a load tester, or will it be running every day? – mbeckish Feb 01 '11 at 15:15
  • What are you trying to accomplish here? – epascarello Feb 01 '11 at 15:36
  • It would run continuously. Final goal: Extract text content of some nodes. For sites with AJAX and company I need to make sure that all the content is there, hence the requirement to actually "execute" the page. – OG Dude Feb 01 '11 at 17:56

3 Answers3

4

Look at one of the headless browsers for .NET - they will be faster than the BrowserControl as they don't need to render a graphical view.

I don't know if this will allow you to execute 1000 pages per minute, but should be much faster than the control.

Here is one.

Here is a blog post about using HtmlUnit as a headless browser.

And an SO question about headless browsers.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • Well at that point 1000 pages / minute is just an issue of computer power. It should be able to scale to mulitple servers. – Zachary K Feb 01 '11 at 16:55
  • 1
    +1 I currently use HtmlUnit converted to a .Net Assembly using IKVM to page scrap hundreds of queries off of a javascript based web query interface. To maintain 1000 pages an hour will be difficult, and I am unsure how much control it has over modifying the DOM, but otherwise it is the only reliable solution for mimicking javascript on that scale. Tools like WATIN and Selenium will either be too slow or inaccurate. – keithwill Feb 01 '11 at 20:11
1

I have an application that is implemented in WinForms that processes ~7,800 URLs in approximately 5 minutes (downloads the URL, parses the content, looks for specific pieces of data and if it finds what its looking for does some additional processing on that page.

This specific application used to take between 26 to 30 minutes to run, but by changing the code to the TPL (Task Parallel Library in .NET v4.0) it executes in just 5. The computer is a Dell T7500 workstation with dual quad core Xeon processors (3 GHz), running with 24 GB of RAM, and Windows 7 Ultimate 64-bit edition.

I simply use WebClient, Stream, and StreamReader objects within a Parallel.ForEach() loop, and it's extremely fast.

Probably not the exact solution you're looking for, but unlike most of the other postings I see here this actualy does "process 1,000 pages / minute" [and more].

Food for thought ...

BonanzaDriver
  • 6,411
  • 5
  • 32
  • 35
0

I think node.js can do a lot of what you want and do it fast if you are not married to the .net solution. It definitly has a dom implementation.

Zachary K
  • 3,205
  • 1
  • 29
  • 36