0

I am writing a simple https client that will pull down the html of a webpage over https. I can connect to the webpage fine however the html I pull down is gibberish.

public String GetWebPageHTTPS(String URI){
    BufferedReader read;
    URL inputURI;
    String line;
    String renderedPage = "";
    try{
        inputURI = new URL(URI);
        HttpsURLConnection connect;
        connect = (HttpsURLConnection)inputURI.openConnection();
        connect.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401");
        read = new BufferedReader (new InputStreamReader(connect.getInputStream()));
        while ((line = read.readLine()) != null)
            renderedPage += line;
        read.close();
    }
    catch (MalformedURLException e){
        e.printStackTrace();
    }
    catch (IOException e){
        e.printStackTrace();
    }
    return renderedPage;
}

When I pass it a string like https://kat.ph/ around 10,000 characters of gibberish is returned

EDIT Here is my modified code for self-signing certs however I'm still getting the encrypted stream:

public String GetWebPageHTTPS(String URI){
    TrustManager[] trustAllCerts = new TrustManager[] { 
            new X509TrustManager() {     
                public java.security.cert.X509Certificate[] getAcceptedIssuers() { 
                    return null;
                } 
                public void checkClientTrusted( 
                    java.security.cert.X509Certificate[] certs, String authType) {
                    } 
                public void checkServerTrusted( 
                    java.security.cert.X509Certificate[] certs, String authType) {
                }
            } 
        }; 
        try {
            SSLContext sc = SSLContext.getInstance("SSL"); 
            sc.init(null, trustAllCerts, new java.security.SecureRandom()); 
            HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
        } catch (GeneralSecurityException e) {
        } 
        try { 
            System.out.println("URI: " + URI);
            URL url = new URL(URI); 
        } catch (MalformedURLException e) {
        } 
    BufferedReader read;
    URL inputURI;
    String line;
    String renderedPage = "";
    try{
        inputURI = new URL(URI);
        HttpsURLConnection connect;
        connect = (HttpsURLConnection)inputURI.openConnection();
        read = new BufferedReader (new InputStreamReader(connect.getInputStream()));
        while ((line = read.readLine()) != null)
            renderedPage += line;
        read.close();
    }
    catch (MalformedURLException e){
        e.printStackTrace();
    }
    catch (IOException e){
        e.printStackTrace();
    }
    return renderedPage;
}
JaminB
  • 778
  • 3
  • 9
  • 20

2 Answers2

1

"is it compressed by any chance? stackoverflow.com/questions/8249522/…" – Mahesh Guruswamy

yes, turns out it was just gzip compressed here is my work around for this

public String GetWebPageGzipHTTP(String URI){ 
    String html = "";
    try {
        URLConnection connect = new URL(URI).openConnection();                        
        BufferedReader in = null;
        connect.setReadTimeout(10000);
        connect.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401");
        if (connect.getHeaderField("Content-Encoding")!=null && connect.getHeaderField("Content-Encoding").equals("gzip")){
            in = new BufferedReader(new InputStreamReader(new GZIPInputStream(connect.getInputStream())));            
        } else {
            in = new BufferedReader(new InputStreamReader(connect.getInputStream()));            
        }          
        String inputLine;
        while ((inputLine = in.readLine()) != null){
        html+=inputLine;
        }
    in.close();
        return html;
    } catch (Exception e) {
        return html;
    }
}

}

JaminB
  • 778
  • 3
  • 9
  • 20
0

HTTPS always presents a Certificate and the further communication happens on a secure encrypted channel. That is why what you are receiving looks like gibberish.

For any signed certificates, HttpsURLConnection will do the work for you and everything works. Things become muddy when the Certificate is not signed by a certificate authority. In such instances if you open that URL from a browser, it will present the Certificate for you to examine and accept before continuing.

Looks like you have the similar issue here. What you need to do is to tell Java to accept self-signed certificates without complaining. You have two options here, either download the certificate (just open the URL in any browser and it will show you how to) and add it to the keystore inn your JVM or create your own TrustManager and disable the Certificate Validate.

See this SO answer for details of both these options. https://stackoverflow.com/a/2893932/2385178

Community
  • 1
  • 1
Raza
  • 856
  • 6
  • 8
  • Thank you I used this method in the edit above with no luck. I noticed that KAT.ph require a cookie be downloaded. I am not accepting any cookies in my client. Is there any chance this could play a role? – JaminB May 17 '13 at 16:12
  • Ok I'm fairly sure this is an encoding problem. – JaminB May 17 '13 at 16:31
  • Sorry I never worked with Cookies, can't give you an answer from my experience. Here is an SO answer for you http://stackoverflow.com/a/8280340/2385178 related to Cookie handling. Seems like this is what you are looking for. Make sure you go through the example and documentation mentioned in this answer. – Raza May 20 '13 at 09:05