0

I'm downloading an HTML file and I need to display it with System.out.println().

The problem is that instead of Greek characters I get rubbish.

I'm using to download the code below to download the HTML file:

 URL url = new URL("here goes the link to the html file");
 BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
 String htmlfile = "";
 String temp;
 while ((temp = br.readLine()) != null) {
       htmlfile+= temp;
 }
 System.out.println(htmlfile);

Is it possible to fix this problem? Here is a sample of what I get as a result:

    <title>Ξ Ολη  ΞλΡκΟΟΏΟ ΟΏ δικΟΟΞ±ΞΊΟ ΟΟΟΞΏ</title>

All my regional settings on my computer are fine. I can use System.out.println to display Greek words directly. I have a feeling I need to change some locale settings in the BufferedReader but I'm not sure how to do it, or whether that's the correct way of approaching this problem.

Somewhat off topic, I have a feeling the above method of downloading the HTML file is really ineffective. For example, when I use html+=temp, aren't I basically creating a new String instance every time I read a line from the HTML file? This sounds very costly, if you can please suggest me other methods of doing the same thing that is more effective.

casperOne
  • 73,706
  • 19
  • 184
  • 253
jan1
  • 613
  • 2
  • 10
  • 17
  • 1
    What's costly is when you keep on concatenating `temp` to `htmlFile`. That's `O(n)` time complexity. I would just output `temp` each time it changes: `while ((temp = br.readLine()) != null) { System.out.print(temp); }` – eboix Feb 26 '12 at 14:29
  • Have you made sure your output stream is in UTF-8? – tchrist Feb 26 '12 at 14:39
  • tchrist, I didn't know I could do that, I tried the method initially suggested by Joop Eggen and it worked fine :) – jan1 Feb 26 '12 at 14:41

2 Answers2

2
String encoding = "UTF-8"; // Or "ISO-8859-7"
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream(), encoding));

ISO-8859-1 is the 8-bit encoding used by Greek, UTF-8 the multibyte unicode encoding.

StringBuilder sb = new StringBuilder();
String temp;
while ((temp = br.readLine()) != null) {
    sb.append(temp).append("\n");
    System.out.println(temp);
}
String html = sb.toString();

readLine removes the line ending (\r old MacOS, \n Unix or \r\n Windows).

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
1

You need to use the content type's character set as specified by the response headers.

The below adapts Using java.net.URLConnection to fire and handle HTTP requests to your problem.

URL url = new URL("here goes the link to the html file");
URLConnection conn = url.openConnection();
try {
  InputStream in = conn.getInputStream();
  // Look at the input connection headers to figure out the character encoding.
  // The contentType is null or a String like "text/html; charset=UTF-8"
  String contentType = conn.getContentType();
  // Get the charset from the content type.
  String charset = null;
  if (contentType != null) {
    for (String param : contentType.replace(" ", "").split(";")) {
      if (param.startsWith("charset=")) {
        charset = param.split("=", 2)[1];
        break;
      }
    }
  }
  // Choose a default that does not depend on the default encoding.
  // It might be best to use the default encoding if the URL is a
  // file: URL.
  if (charset == null) { charset = "UTF-8"; }
  Reader r = new InputStreamReader(in, charset);
  BufferedReader br = new BufferedReader(r);
  // Read the content from the buffered reader as above.
  // See below.
} finally {
  conn.close();
}

Somewhat off topic, I have a feeling the above method of downloading the HTML file is really ineffective. For example, when I use html+=temp, aren't I basically creating a new String instance every time I read a line from the HTML file?

Yes, the below is a more efficient way to read characters.

StringBuilder sb = new StringBuilder();
char[] buf = new char[4096];
for (int nRead; (nRead = br.read(buf)) > 0;) {
  sb.append(buf, 0, nRead);
}
String html = sb.toString();

You can read the Content-length header via conn.getHeaderField("Content-length") to get a hint at the size of the content, to pre-size the StringBuilder.

Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • Plus for the `getContentType()`; could use `getContentEncoding()` too. – Joop Eggen Feb 26 '12 at 14:53
  • @JoopEggen, What does content encoding have to do with charset? Doesn't that distinguish between chunked encoding vs gzip vs other encoding forms that have nothing to do with byte <-> char mappings? – Mike Samuel Feb 26 '12 at 15:19