1

After reading and trying various approaches to reading the content of websites I realised that I am unable to get the input stream of secure web pages (or have to wait for minutes for a single response). These are pages that I can easily access via a browser (no proxies involved).

The different fixes that I tried are the following:

  • Setting user agent
  • Following redirects
  • Using JSoup
  • Catering for encoding
  • Using Scanner to parse the stream
  • Using cookie manager

The two basic approaches that seem most popular, one using Jsoup:

Document doc = Jsoup.connect(url)
                     .userAgent(userAgent)
                     .timeout(5000).followRedirects(true).execute().parse();
Elements body = doc.select("body");
System.out.println(body.html());

The other with vanilla Java:

URL obj = new URL(url);
CookieHandler.setDefault(new CookieManager(null, CookiePolicy.ACCEPT_ALL));
HttpsURLConnection con = (HttpsURLConnection) obj.openConnection();

con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", userAgent);
con.setConnectTimeout(5000);
con.setReadTimeout(5000);

BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuilder response = new StringBuilder();

while ((inputLine = in.readLine()) != null) {
    response.append(inputLine);
}
in.close();

The execution goes into a painfully slow loop in case of https addresses such as https://en.wikipedia.org/wiki/Java in the execution phase (so no status code can be retrieved from the response either), or when the input stream is being fetched in case of the second attempt. Both approaches work perfectly fine for insecure addresses (e.g. http://www.codingpedia.org/ ) - with a switch to HttpURLConnection.

Weirdly, there is also very high disc usage while waiting ( > 20 MB/sec).

All help greatly appreciated!

  • You may have run into this issue: https://stackoverflow.com/questions/4819359/dev-random-extremely-slow – Henry May 31 '17 at 11:10
  • Some more info: https://docs.oracle.com/cd/E13209_01/wlcp/wlss30/configwlss/jvmrand.html – Henry May 31 '17 at 11:13
  • Even though I could not yet figure out the exact reason, the issue seems to be with my setup. Running the same code on a different machine works perfectly. For clarity: I am running Java 8 on a windows 10 machine. The other machine is running windows 8 (though I doubt that's the issue). – gergocsegzi May 31 '17 at 11:24
  • This is still consistent with the random number issue. If there is a lot of other activity on the machine it will have enough entropy in the random number pool. – Henry May 31 '17 at 11:27
  • Just tried the command on the Oracle page for generating random (with /dev/random) and it works immediately, so, unfortunately, that doesn't seem to be the issue. – gergocsegzi May 31 '17 at 11:34

0 Answers0