After reading and trying various approaches to reading the content of websites I realised that I am unable to get the input stream of secure web pages (or have to wait for minutes for a single response). These are pages that I can easily access via a browser (no proxies involved).
The different fixes that I tried are the following:
- Setting user agent
- Following redirects
- Using JSoup
- Catering for encoding
- Using Scanner to parse the stream
- Using cookie manager
The two basic approaches that seem most popular, one using Jsoup:
Document doc = Jsoup.connect(url)
.userAgent(userAgent)
.timeout(5000).followRedirects(true).execute().parse();
Elements body = doc.select("body");
System.out.println(body.html());
The other with vanilla Java:
URL obj = new URL(url);
CookieHandler.setDefault(new CookieManager(null, CookiePolicy.ACCEPT_ALL));
HttpsURLConnection con = (HttpsURLConnection) obj.openConnection();
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", userAgent);
con.setConnectTimeout(5000);
con.setReadTimeout(5000);
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuilder response = new StringBuilder();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
The execution goes into a painfully slow loop in case of https addresses such as https://en.wikipedia.org/wiki/Java in the execution phase (so no status code can be retrieved from the response either), or when the input stream is being fetched in case of the second attempt. Both approaches work perfectly fine for insecure addresses (e.g. http://www.codingpedia.org/ ) - with a switch to HttpURLConnection.
Weirdly, there is also very high disc usage while waiting ( > 20 MB/sec).
All help greatly appreciated!