2

I have a page crawler developed in Java using Selenium libraries. The crawler goes through a website that launches through Javascript 3 applications which are displayed as HTML in popup windows.

The crawler has no issues when launching 2 of the applications, but on the 3rd one the crawler freezes forever.

The code I'm using is similar to

public void applicationSelect() {
  ...
  //obtain url by parsing tag href attributed
  ...

  this.driver = new HtmlUnitDriver(BrowserVersion.INTERNET_EXPLORER_8);
  this.driver.seJavascriptEnabled(true);
  this.driver.get(url); //the code does not execute after this point for the 3rd app
  ...
}

I have also tried clicking on the web element through the following code

public void applicationSelect() {
  ...
  WebElement element = this.driver.findElementByLinkText("linkText");
  element.click(); //the code does not execute after this point for the 3rd app
  ...
}

Clicking on it produces exactly the same result. For the above code, I've made sure I am getting the right element.

Can anyone tell me what could be the problem I'm having?

On the application side, I cannot disclose any information about the html code. I know this makes things harder for trying to solve the problem and for that I apologize in advance.

=== Update 2013-04-10 ===

So, I added the sources to my crawlers and saw where in this.driver.get(url) it was getting stuck on.

Basically, the driver gets lost in an infinite refresh loop. Within a WebClient object instantiated by HtmlUnitDriver, an HtmlPage is loaded which continually refreshes seemingly without end.

Here is the code from WaitingRefreshHandler, which is contained in com.gargoylesoftware.htmlunit:

public void handleRefresh(final Page page, final URL url, final int requestedWait) throws IOException {
  int seconds = requestedWait;
  if (seconds > maxwait_ && maxwait_ > 0) {
    seconds = maxwait_;
  }
  try {
    Thread.sleep(seconds * 1000);
  }
  catch (final InterruptedException e) {
    /* This can happen when the refresh is happening from a navigation that started
     * from a setTimeout or setInterval. The navigation will cause all threads to get
     * interrupted, including the current thread in this case. It should be safe to
     * ignore it since this is the thread now doing the navigation. Eventually we should
     * refactor to force all navigation to happen back on the main thread.
     */
    if (LOG.isDebugEnabled()) {
      LOG.debug("Waiting thread was interrupted. Ignoring interruption to continue navigation.");
    }
  }
  final WebWindow window = page.getEnclosingWindow();
  if (window == null) {
    return;
  }
  final WebClient client = window.getWebClient();
  client.getPage(window, new WebRequest(url));
}

The instruction "client.getPage(window, new WebRequest(url))" calls WebClient once again to reload the page, only to once more call this very same refresh method. This seems to go on indefinetly, not filling up the memory quickly only because of the "Thread.sleep(seconds * 1000)", which forces a 3m wait before trying again.

Does anyone have any suggestion on how I can work around this issue? I got a suggestion to create 2 new HtmlUnitDriver and WebClient classes which extend the original ones. Then override the relevant methods in order to avoid this problem.

Thanks again.

ulaikamor
  • 61
  • 7
  • when the website launches popups, how do you select the 3rd "app" which is a popup? – Amey Apr 05 '13 at 16:43
  • I select the popup with "this.driver.switchTo().frame(frameName);". Regardless, this is not the issue since execution gets stuck on "this.driver.get(url);". – ulaikamor Apr 09 '13 at 10:21
  • I have now tried adding the session cookie to the driver, with no success. It still gets stuck on the same instruction. Where I'm at, there are no additional cookies. – ulaikamor Apr 09 '13 at 10:28

1 Answers1

4

I solved my eternal refresh problem by creating a do nothing RefreshHandler class:

public class RefreshHandler implements com.gargoylesoftware.htmlunit.RefreshHandler {   
  public RefreshHandler() { }
  public void handleRefresh(final Page page, final URL url, final int secods) { }
}

In addition, I extended the HtmlUnitDriver class and by overriding the method modifyWebClient, I set the new RefreshHandler:

public class HtmlUnitDriverExt extends HtmlUnitDriver { 
  public HtmlUnitDriverExt(BrowserVersion version) {
    super(version);
  }
  @Override
  protected WebClient modifyWebClient(WebClient client) {
    client.setRefreshHandler(new RefreshHandler());
    return client;
  }
}

The method modifyWebClient is a do nothing method created in HtmlUnitDriver exactly for this purpose.

Cheers.

ulaikamor
  • 61
  • 7