Reading HTML from URL in Java vs. Python

Question

I'm trying to read the HTML from a particular URL and store it into a String for parsing. I referred to a previous post to help me out. When I print out what was read, all I get are special characters.

Here is my Java code (with try/catches left out) that reads from a URL and prints:

String path = "https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-6b1bd15200.jsonp";
URL url = new URL(path);
InputStream in = url.openStream();

BufferedReader bw = new BufferedReader(new InputStreamReader(in, "UTF-8");

String line;            
while ((line = bw.readLine()) != null) {
    System.out.println(line);
}

Program output:

�ĘY106-6b1bd15200.jsonpmP�r� �Ƨ�!�%m�vD"��Ra*��w�%����ݳ�sβ��MK�d�9+%�m��l^��މ����:����  ���8B�Vce�.A*��x$FCo���a�b�<����Xy��m�c�>t����� �Z������Gx�o�   �J���oKe�0�5�kGYpb�*l����+|�U���-�N3��jBp�R�z5Cۥjh��o�;�~)����~��)~ɮhy��<c,=;tHW���'�c�=~�w���

Expected output:

window.page106_callback(["<div class=\"newpage\" id=\"page106\" style=\"width: 902px; height:1273px\">\n<div class=image_layer style=\"z-index: 1\">\n<div class=ie_fix>\n<img class=\"absimg\" style=\"left:18px;top:27px;width:860px;height:1077px;clip:rect(1px 859px 1076px 1px)\" orig=\"http://html.scribd.com/913q5pjrsw60h9i4/images/106-6b1bd15200.jpg\"/>\n</div>\n</div>\n</div>\n\n"]);

At first, I thought it was an issue with permissions or something that somehow encrypted the stream, but my friend wrote a small Python script to do the same thing and it worked, thereby ruling this out. This is what he wrote:

import requests

link = 'https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106- 
6b1bd15200.jsonp'
f = requests.get(link)
text = (f.text)
print(text)

So the question is, why is the Java version unable to correctly read and print from this particular URL? Note that I tried testing some other URLs from various websites and those worked fine. Maybe I should learn Python.

Whatever the language, it's javascript that's returned, not HTML. — Maurice Perry, Feb 26 '19 at 06:03
I guess I'm assuming that the InputStream contains the HTML of the page at the URL given. Whether or not the stream contains JavaScript or JavaScript embedded in HTML makes no difference to me. — Wishcle, Feb 26 '19 at 06:12

score 1 · Accepted Answer · answered Feb 26 '19 at 06:28

1

The response is gzip-encoded. You can do:

        InputStream in = new GZIPInputStream(con.getInputStream());

answered Feb 26 '19 at 06:28

Maurice Perry

9,261
2
12
24

Thanks, worked like a charm! Is this because of what you mentioned in the comment above about it being JavaScript that's returned? Or a decision by the site to encode their stream? – Wishcle Feb 26 '19 at 14:27
@Wishcle there are two headers in the response: `Content-Type: application/x-javascript` and `Content-Encoding: gzip`. The gzip encoding could be the default in the http engine (nginx). – Maurice Perry Feb 26 '19 at 14:38

score 0 · Answer 2 · edited Feb 26 '19 at 08:51

@Maurice Perry is right, I tried with below code

String url = "https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-6b1bd15200.jsonp";

URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();

BufferedReader in = new BufferedReader(
        new InputStreamReader(new GZIPInputStream(con.getInputStream())));
String inputLine;
StringBuffer response = new StringBuffer();

while ((inputLine = in.readLine()) != null) {
    response.append(inputLine);
}
in.close();

System.out.println(response.toString());

Reading HTML from URL in Java vs. Python

2 Answers2