1

https://www.reddit.com/r/buildapcsales/top/ takes like 3~ seconds to load all the content. Currently using jsoup I can only scrape the first 7 threads since the other threads are loaded after a few seconds. I'm trying to make htmlunit load the entire page then use jsoup to scrape all the thread titles.

        WebClient webClient = new WebClient(BrowserVersion.CHROME);
        webClient.getOptions().setJavaScriptEnabled(true);
        Page page = webClient.getPage(url.toString());
        WebResponse response = page.getWebResponse();
        String content = response.getContentAsString();



      //  webClient.getOptions().setJavaScriptEnabled(true);
      //  webClient.getOptions().setThrowExceptionOnScriptError(true);
       // webClient.waitForBackgroundJavaScript(50000);
       // webClient.wait(5000);
       // HtmlPage page = webClient.getPage(url.toString());

I keep getting a million errors whenever I setJavascriptEnabled to true, but if I turn it false. It doesn't error out, however I still get the 7 threads with jsoup.

WARNING: Script is not JavaScript (type: 'application/json', language: ''). Skipping execution. Feb 09, 2020 4:54:36 PM com.gargoylesoftware.htmlunit.javascript.DefaultJavaScriptErrorListener scriptException SEVERE: Error during JavaScript execution ======= EXCEPTION START ======== Exception class=[net.sourceforge.htmlunit.corejs.javascript.EvaluatorException] com.gargoylesoftware.htmlunit.ScriptException: syntax error (https://www.redditstatic.com/desktop2x/vendors~Governance~Reddit.791bf381e13bfdc452ab.js#1) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:882) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:624) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:537) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.callSecured(HtmlUnitContextFactory.java:354) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.compile(JavaScriptEngine.java:713) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.compile(JavaScriptEngine.java:679) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.compile(JavaScriptEngine.java:103) at com.gargoylesoftware.htmlunit.html.HtmlPage.loadJavaScriptFromUrl(HtmlPage.java:1104) at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:984) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:361) at com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:234) at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:301) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:560) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:419) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:336) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:488) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:469) at RedditScraper.main(RedditScraper.java:40)

These are some of the first few errors

rairai
  • 21
  • 6
  • 2
    I am pretty sure reddit has an API, why don't you try it? Another option would be trying to scrape the json traffic directly instead of generating the dynamic html. – fonkap Feb 15 '20 at 09:57
  • See [this related post](https://stackoverflow.com/q/50189638/8583692). – Mahozad Nov 15 '21 at 13:29

1 Answers1

0

I was having a hard time with trying to run javascript inside of HtmlUnit. Then I tried Selenium, and it worked like a charm.

Sir Beethoven
  • 348
  • 1
  • 8