Java - Download https page

Question

I'm trying to download the content of a webpage with this code, but it does not get the same as Firefox.

URL url = new URL("https://jumpseller.cl/support/webpayplus/");
InputStream is = url.openStream();
Files.copy(is, Paths.get("/tmp/asdfasdf"), StandardCopyOption.REPLACE_EXISTING);

When I check /tmp/asdfasdf it is not the html source code of the page, but just bytes (no text). But still, in Firefox I can see the webpage and its source code

How can I get the real webpage?

I work at Jumpseller.cl. Feel free to email us and we can provide you the full content of the file (considering you will provide adequate credit to us). — tiagomatos, Jan 13 '16 at 11:26

score 0 · Answer 1 · answered Jan 12 '16 at 20:14

You need to examine the response headers. The page is compressed. The Content-Encoding header has a value of gzip.

Try this:

URL url = new URL("https://jumpseller.cl/support/webpayplus/");
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();

if ("gzip".equals(conn.getContentEncoding())) {
    is = new GZIPInputStream(is);
}

Files.copy(is, Paths.get("/tmp/asdfasdf"), StandardCopyOption.REPLACE_EXISTING);

score 0 · Answer 2 · answered Jan 12 '16 at 20:15

Use HtmlUnit library and this code:

    try(final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.waitForBackgroundJavaScript(5 * 1000);         
        HtmlPage page = webClient.getPage("https://jumpseller.cl/support/webpayplus/");
        String stringToSave = page.asXml(); // It's a string with full HTML-code, if need you can save it to file.
        webClient.close();  
    }

Java - Download https page

2 Answers2