2

I have been doing web scraping for a few months now and always get stuck on pages which load data using javascript.
I have a certain degree of success on such pages using HTMLunit but sometimes Htmlunit throws out these unusual exceptions and eventually doesnot load pages. Well I have to say it has been a hit and miss using HTMLunit.
Is there a concrete way to achieve it ??
But also on my part I haven't dug deep on HTMLunit. So what would your suggestion be ?? Should I stick around with HTMLunit or are there other good methods (libraries) to achieve javascript processing ??

Just for the record I am using Java as my primary language.

haedes
  • 612
  • 2
  • 10
  • 23
  • You could use something like [phantom.js](http://phantomjs.org/) to reconstruct the actual page and then use this for your scraping. – Sirko Jun 06 '13 at 08:52
  • Hope the following links help.. http://stackoverflow.com/questions/5561950/how-to-scrape-https-javascript-web-pages http://stackoverflow.com/questions/260540/how-do-you-screen-scrape-ajax-pages http://stackoverflow.com/questions/16762127/scraping-data-from-website-that-uses-javascript – Anand Shah Jun 06 '13 at 08:56
  • You have to use selenium and a browser driver. The browser would run in headless mode and render the page as if it were a real brower – ACV Feb 02 '21 at 22:29
  • I guess this question was asked before Selenium even existed :) – ACV Feb 02 '21 at 22:30

1 Answers1

1

I've been web scraping with Htmlunit for 2-3 years now, there are some configurations that may help you handle loading problems:

webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Edit some js, prior to execution
webClient.setScriptPreProcessor(new JavascriptPreProcessor() { ... } );
// Avoid throwing errors on JS execution
webClient.setThrowExceptionOnScriptError(false);
// Avoid throwing errors because of wrong response codes
webClient.setThrowExceptionOnFailingStatusCode(false);
brnfd
  • 464
  • 2
  • 8
  • well i have been using these configuration myself and to a certain extent have worked well but occasionally have let me down on some sites !!! . Anyways thanks !! – haedes Jun 07 '13 at 04:44
  • sorry to hear that, maybe we can find a solution on this situations. – brnfd Jun 07 '13 at 17:28