How would I go about downloading and executing (i.e evaluate Javascript, build DOM) in excess of 1000 XHTML documents per minute?
Some outlines/constraints:
- URLs to be downloaded are on different servers.
- I need to traverse - and ideally modify the resulting DOM.
- No interest in rendering the graphics.
- Bandwidth is not an issue.
- Overly massive hardware parallelization would be more of a problem.
- Production enviroment is .NET.
I am not so concerned about downloading the pages. I estimate that actually excuting the page is a bottleneck. .NET has a built in Web Browser object but I have no idea if it would scale up on a single machine. Also, .NET is not an absolute requirement but it would make integration around here easier.
I'd be grateful for any comments/pointers regarding:
- Which browser API is most suited to do this?
- Is a browser the right way to go - maybe there's a more lightweight way to execute the Javascript which is the most important part (... but does not provide a DOM)?
- What existing products/services - be they open source or commerical - may accomplish the task?
- Roughly how many pages per minute I can expect to handle on a single machine (3ms Chrome rendering commercial anyone)?
- Any pitfalls one is likely to encounter...
Thank you in advance,
/David