0
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.ThreadedRefreshHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;    
public class ReadHtml{
       public static void main(String[] args) throws Exception {
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
    WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setActiveXNative(true);
    webClient.getOptions().setAppletEnabled(false);
    webClient.getOptions().setCssEnabled(true);
    webClient.getOptions().setDoNotTrackEnabled(true);
    webClient.getOptions().setGeolocationEnabled(false);
    webClient.getOptions().setPopupBlockerEnabled(false);
    webClient.getOptions().setPrintContentOnFailingStatusCode(true);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(true);
    webClient.getOptions().setThrowExceptionOnScriptError(true);
    webClient.setAjaxController(new NicelyResynchronizingAjaxController());
    webClient.setCssErrorHandler(new SilentCssErrorHandler());
    webClient.setRefreshHandler(new ThreadedRefreshHandler());
    webClient.getCookieManager().setCookiesEnabled(true);
    WebRequest request = new WebRequest(new URL("some url containing javascript to load html elements"));
    try {
            Page page;
            page = webClient.getPage(request);
            //System.out.println(page.getWebResponse().getContentAsString());
            System.out.println(((HtmlPage) page).asXml());
    } catch (FailingHttpStatusCodeException e) {
            e.printStackTrace();
    } catch (IOException e) {
            e.printStackTrace();
    }
}
}

I want to print all html element(not only source code), including html which are produced by javascript,iframes, nested iframes. I tried with this code but (also tried identifying by id,name but not prefer to print anyting specifically. want to print entire html contents), html load by javascript is not printing to console. Can Someone point out the modification need to be carried out? Thanks in advance.

RDD
  • 145
  • 1
  • 18
  • The title and the details of the query are not same. Do you only want the script files or the final dom? If you only want to look at the request & responses, you can as well use tools like Fiddler. – Paddy Jul 08 '14 at 15:25
  • Thanks @Paddy .. sorry for the title. Yes i want the final dom. – RDD Jul 09 '14 at 03:32

2 Answers2

2

I found some solution for my task (Not exactly what i want )

List<WebWindow> windows = webClient.getWebWindows();
for(WebWindow w : windows){
        HtmlPage hpage2 = (HtmlPage) w.getEnclosedPage();
        System.out.println("-------------------------------------");
        System.out.println(hpage2.asXml());
}

By this way i could able to get all the iframe contents and nested iframe contents.Not as continuous page but as seperately.

when i know the iframe name i could extract that contents by

HtmlPage hpage = (HtmlPage)webClient.getWebWindowByName("google_esf").getEnclosedPage();

for now this resolves my problem.Still its better if someone suggest how to get as continuous page.

RDD
  • 145
  • 1
  • 18
0

Try using page.asXML.

HTMLPage itself is a DOM Node, so you can iterate through the children recursively The frames may be accessed (recursively) via DOM or via page.getFrames.

If you need to print all the responses from server, you can use WebConnectionWrapper as interceptor. This will get you access to all the responses (including Script ones)


July 9

Frames are part of the DOM. But, if some of the content is being loaded asynchronously (Ajax), HTMLUnit might not have waited for that to load. Try adding an AjaxController to your WebClient. Here is an example.

For WebConnectoinWrapper, use this example. But again, if there is some asynchronous processing, HTMLUnit may exit before all the processing is done. So, AjaxController might be your best bet.

browser.setWebConnection(new WebConnectionWrapper(browser) {
  public WebResponse getResponse(final WebRequest request) throws IOException {
    WebResponse response = super.getResponse(request);
    //processResponse
    return response;
 }
});

July 10

NicelyResynchronizingAjaxController works for user initiated ajax. For "self loading" ones try something like this.

public class AlwaysSynchronizingAjaxController extends NicelyResynchronizingAjaxController {
public boolean processSynchron(HtmlPage page, WebRequest settings, boolean async) {
    return true;
};
}

If you are using Fiddler (or wireshark or any other sniffing/interceptor tools), see if you find the communication for the dynamically loaded requests.

Community
  • 1
  • 1
Paddy
  • 609
  • 7
  • 25
  • Yes as you can see in my code, i used getContentAsSring() and page.asXml(), there i found that i can page.asXml() can go inside the first iframe, but its not loading the scrips inside that iframe.(multilevel contents loading). Is it issues with "Same as origin policy" ? and i want more information to use WebConnection Wrapper as intercepter. Please help. Thank you. – RDD Jul 09 '14 at 03:36
  • the solution you mentioned with NicelyResynchronizingAjaxController() didn't workout, i already tried with it. – RDD Jul 10 '14 at 04:23