JAVA: Greek characters from downloaded HTML file aren't displayed, how can I fix this?

Question

I'm downloading an HTML file and I need to display it with System.out.println().

The problem is that instead of Greek characters I get rubbish.

I'm using to download the code below to download the HTML file:

 URL url = new URL("here goes the link to the html file");
 BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
 String htmlfile = "";
 String temp;
 while ((temp = br.readLine()) != null) {
       htmlfile+= temp;
 }
 System.out.println(htmlfile);

Is it possible to fix this problem? Here is a sample of what I get as a result:

    <title>Ξ ΟΞ»Ξ·  ΞΞ»Ξ΅ΞΊΟΟΏΟ ΟΏ Ξ΄ΞΉΞΊΟΟΞ±ΞΊΟ ΟΟΟΞΏ</title>

All my regional settings on my computer are fine. I can use System.out.println to display Greek words directly. I have a feeling I need to change some locale settings in the BufferedReader but I'm not sure how to do it, or whether that's the correct way of approaching this problem.

Somewhat off topic, I have a feeling the above method of downloading the HTML file is really ineffective. For example, when I use html+=temp, aren't I basically creating a new String instance every time I read a line from the HTML file? This sounds very costly, if you can please suggest me other methods of doing the same thing that is more effective.

What's costly is when you keep on concatenating `temp` to `htmlFile`. That's `O(n)` time complexity. I would just output `temp` each time it changes: `while ((temp = br.readLine()) != null) { System.out.print(temp); }` — eboix, Feb 26 '12 at 14:29
tchrist, I didn't know I could do that, I tried the method initially suggested by Joop Eggen and it worked fine :) — jan1, Feb 26 '12 at 14:41

Joop Eggen · Accepted Answer · 2012-02-26T14:56:47.917

2

String encoding = "UTF-8"; // Or "ISO-8859-7"
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream(), encoding));

ISO-8859-1 is the 8-bit encoding used by Greek, UTF-8 the multibyte unicode encoding.

StringBuilder sb = new StringBuilder();
String temp;
while ((temp = br.readLine()) != null) {
    sb.append(temp).append("\n");
    System.out.println(temp);
}
String html = sb.toString();

readLine removes the line ending (\r old MacOS, \n Unix or \r\n Windows).

edited Feb 26 '12 at 14:56

answered Feb 26 '12 at 14:27

Joop Eggen

107,315
7
83
138

"ISO-8859-1 is the 8-bit encoding used by Greek" -- As you put in your comment, it's actually ISO-8859-7. +1, though. – eboix Feb 26 '12 at 14:37
MacOS hasn’t use `\r` since time immemorial. – tchrist Feb 26 '12 at 14:39

score 1 · Answer 2 · edited May 23 '17 at 11:48

You need to use the content type's character set as specified by the response headers.

The below adapts Using java.net.URLConnection to fire and handle HTTP requests to your problem.

URL url = new URL("here goes the link to the html file");
URLConnection conn = url.openConnection();
try {
  InputStream in = conn.getInputStream();
  // Look at the input connection headers to figure out the character encoding.
  // The contentType is null or a String like "text/html; charset=UTF-8"
  String contentType = conn.getContentType();
  // Get the charset from the content type.
  String charset = null;
  if (contentType != null) {
    for (String param : contentType.replace(" ", "").split(";")) {
      if (param.startsWith("charset=")) {
        charset = param.split("=", 2)[1];
        break;
      }
    }
  }
  // Choose a default that does not depend on the default encoding.
  // It might be best to use the default encoding if the URL is a
  // file: URL.
  if (charset == null) { charset = "UTF-8"; }
  Reader r = new InputStreamReader(in, charset);
  BufferedReader br = new BufferedReader(r);
  // Read the content from the buffered reader as above.
  // See below.
} finally {
  conn.close();
}

Somewhat off topic, I have a feeling the above method of downloading the HTML file is really ineffective. For example, when I use html+=temp, aren't I basically creating a new String instance every time I read a line from the HTML file?

Yes, the below is a more efficient way to read characters.

StringBuilder sb = new StringBuilder();
char[] buf = new char[4096];
for (int nRead; (nRead = br.read(buf)) > 0;) {
  sb.append(buf, 0, nRead);
}
String html = sb.toString();

You can read the Content-length header via conn.getHeaderField("Content-length") to get a hint at the size of the content, to pre-size the StringBuilder.

Plus for the `getContentType()`; could use `getContentEncoding()` too. — Joop Eggen, Feb 26 '12 at 14:53
@JoopEggen, What does content encoding have to do with charset? Doesn't that distinguish between chunked encoding vs gzip vs other encoding forms that have nothing to do with byte <-> char mappings? — Mike Samuel, Feb 26 '12 at 15:19

JAVA: Greek characters from downloaded HTML file aren't displayed, how can I fix this?

2 Answers2