7

I'm trying to crawl a website using htmlunit. Whenever I run it though it only outputs the following error:

Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot read property "push" from undefined (https://www.kinoheld.de/dist/prod/0.4.7/widget.js#1)

Now I don't know much about JS, but I read that push is some kind of array operation. This seems standard to me and I don't know why it would not be supported by htmlunit.

Here is the code I'm using so far:

public static void main(String[] args) throws IOException {
    WebClient web = new WebClient(BrowserVersion.FIREFOX_45);
    web.getOptions().setUseInsecureSSL(true);
    String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";
    web.getOptions().setThrowExceptionOnFailingStatusCode(false);
    web.waitForBackgroundJavaScript(9000);
    HtmlPage response = web.getPage(url);

    System.out.println(response.getTitleText());
}

What am I missing? Is there a way around this or a way to fix this? Thanks in advance!

Maverick283
  • 1,284
  • 3
  • 16
  • 33
  • 1
    If it's not supported I guess you should request the developers for a new feature. – Tilak Madichetti Nov 20 '16 at 03:07
  • When does the error occur? After the `web.getPage(url)` or the `response.getTitleText()` call? – Jack Nov 23 '16 at 11:09
  • @Jack The error occurs after the `web.getPage(url)`, as I can comment out the `response.getTitleText()` and it will still be thrown, even when the `web.getOptions().setThrowExceptionOnScriptError(false);` (see answer below) is inserted. – Maverick283 Nov 23 '16 at 13:17
  • @TilakMadichetti Is there a proper place to do this? – Maverick283 Nov 23 '16 at 13:18

2 Answers2

6

Try adding

web.getOptions().setThrowExceptionOnScriptError(false);

before you try to get the page. This forces htmlunit to ignore the error. However, this might not work 100% of the time if for instance the javascript that throws the error is important to get the data you are scrapping (which it hopefully isn't). If that doesn't work, try using Selenium with ChromeDriver or GhostDriver.

Source

Community
  • 1
  • 1
GenuinePlaceholder
  • 685
  • 1
  • 10
  • 25
  • Adding that line doesn't work, it stills throws the same error and doesn't get me anywhere... I'll try whatever Selenium is later when I got more time ;) – Maverick283 Nov 23 '16 at 13:22
  • But before the original exception is in the stack trace, with the line you suggested, it now says `com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify` and then prints the rest of the stack trace. – Maverick283 Nov 23 '16 at 13:33
  • 2
    I really wish i could split the 50 points up, while @Jack s answer did acutally solve the question, your suggestion might be more helpful for me on the long shot... – Maverick283 Nov 24 '16 at 07:33
  • 1
    @Maverick283 No worries, happy to help – GenuinePlaceholder Nov 25 '16 at 00:39
5

I've encountered a similar problem before. This is an issue with HTML Unit being designed as a test harness framework rather than a web scraping one. Are you running the latest version of HTML Unit?

I was able to run your code by adding both the setThrowExceptionOnScriptError(false) (as mentioned in Coffee Converter's answer) line as well as adding java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF); at the top of the method to disable the log dump. This yielded an output of:

Royal Filmpalast München München | kinoheld.de

Full code is as follows:

public static void main(String[] args) throws IOException {

    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

    WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
    String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";

    webClient.getOptions().setUseInsecureSSL(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.waitForBackgroundJavaScript(9000);
    HtmlPage response = webClient.getPage(url);

    System.out.println(response.getTitleText());
}

This was run on RedHat command line with HTML Unit 2.2.1. Hope this helps.

Jack
  • 508
  • 1
  • 9
  • 18