0

In my java code, I am trying to harvest a web page using HTMLUnit libraries. My code is simple as follows,

public static void main(String [] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException
{
        webClient = new WebClient();

        HtmlPage page = webClient.getPage("https://www.xxxxxxx.com/yyyyyy/");

        System.out.println(page.getTitleText());

        webClient.close();

}

However, once I run the code, it produces the following exceptions:

Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
com.gargoylesoftware.htmlunit.ScriptException: SyntaxError: with statements not allowed in strict mode (https://www.wtatennis.com/resources/v2.1.0/scripts/vendors.min.js#1)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:882)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:624)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:537)
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.callSecured(HtmlUnitContextFactory.java:354)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:762)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:738)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:103)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1004)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:361)
    at com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:234)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:256)
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.endElement(HtmlUnitNekoDOMBuilder.java:559)
    at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.endElement(HtmlUnitNekoDOMBuilder.java:513)
    at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1192)
    at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1132)
    at net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter.endElement(DefaultFilter.java:219)
    at net.sourceforge.htmlunit.cyberneko.filters.NamespaceBinder.endElement(NamespaceBinder.java:312)
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3185)
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2110)
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:937)
    at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:443)
    at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:394)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:758)
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:236)
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parseHtml(HtmlUnitNekoHtmlParser.java:179)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:280)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:163)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:553)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:419)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:336)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:488)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:469)
    at htmlunit.WTAHarvester.main(WTAHarvester.java:27)
Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: SyntaxError: with statements not allowed in strict mode (https://www.wtatennis.com/resources/v2.1.0/scripts/vendors.min.js#1)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1215)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:1009)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:111)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:427)
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:340)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3607)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:123)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$2.doRun(JavaScriptEngine.java:753)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:867)
    ... 34 more
JavaScriptException value = SyntaxError: with statements not allowed in strict mode
======= EXCEPTION END ========

Exception in thread "main" ======= EXCEPTION START ========
Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
com.gargoylesoftware.htmlunit.ScriptException: SyntaxError: with statements not allowed in strict mode (https://www.wtatennis.com/resources/v2.1.0/scripts/vendors.min.js#1)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:882)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:624)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:537)
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.callSecured(HtmlUnitContextFactory.java:354)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:762)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:738)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:103)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1004)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:361)
    at com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:234)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:256)
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.endElement(HtmlUnitNekoDOMBuilder.java:559)
    at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.endElement(HtmlUnitNekoDOMBuilder.java:513)
    at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1192)
    at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1132)
    at net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter.endElement(DefaultFilter.java:219)
    at net.sourceforge.htmlunit.cyberneko.filters.NamespaceBinder.endElement(NamespaceBinder.java:312)
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3185)
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2110)
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:937)
    at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:443)
    at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:394)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:758)
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:236)
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parseHtml(HtmlUnitNekoHtmlParser.java:179)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:280)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:163)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:553)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:419)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:336)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:488)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:469)
    at htmlunit.WTAHarvester.main(WTAHarvester.java:27)
Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: SyntaxError: with statements not allowed in strict mode (https://www.wtatennis.com/resources/v2.1.0/scripts/vendors.min.js#1)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1215)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:1009)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:111)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:427)
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:340)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3607)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:123)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$2.doRun(JavaScriptEngine.java:753)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:867)
    ... 34 more
JavaScriptException value = SyntaxError: with statements not allowed in strict mode
======= EXCEPTION END ========
Traveling Salesman
  • 2,209
  • 11
  • 46
  • 83

1 Answers1

1

The problem comes from this file:

https://www.wtatennis.com/resources/v2.1.0/scripts/vendors.min.js#1

That file contains minified libraries, concatenated together. Among these libraries, there is underscore.js, which uses a with statement as you can see in underscoreJS's source code.

But the file it's included in (first link above) also has a "use strict"; statement, which will throw errors when it detects practices it assumes to be unsafe. The with statement is one of them. Other people have had this problem in the past, and it's fixable if they can change their scripts.

That being said, I don't see the error when going to the homepage of that website. But even if I did, I guess you don't have control over the JS which runs on this page. I don't know Java, nor the WebClient class(?) you're using, but maybe you don't need to execute the page's JS, and are able to disable scripts?

webClient.getOptions().setJavaScriptEnabled(false);
blex
  • 24,941
  • 5
  • 39
  • 72
  • Thanks. I can disable it indeed and the errors are gone in this case but I need Java script to load some html elements that I need to harvest and that's why I need to keep Java script enabled. I don't have control over js because it is not my website. So are you saying there is no solution? – Traveling Salesman Feb 28 '20 at 08:02
  • @TravelingSalesman I don't see a _clean_ way to do it unless you have control over the original website. But again, I don't know Java and this WebClient class. However, if I had to do it myself using other tools, I would probably try to intercept the request to that file, and alter the response to remove the `"use strict";` statement. Not very clean, but could get the job done. Maybe [ExchangeFilterFunction](https://stackoverflow.com/a/51728374/1913729) could help with this? **Edit** a comment under that answer suggests that it does not allow accessing the body. Might be a deadend... – blex Feb 28 '20 at 08:44
  • Even dirtier suggestion: instead of altering the body, an alternative could be to alter the headers, by setting the response status to `302 Redirect` and `location` to a URL where you host your own, altered version of the script. – blex Feb 28 '20 at 08:57