0

I'm downloading a website in Java, using all this:

myUrl = new URL("here is my URL");
            in = new BufferedReader(new InputStreamReader(myUrl.openStream()));

In this file however there are some special characters like ä,ö and ü. I need to be able to print these out properly.

I try to encode the Strings using:

String encodedString = new String(toEncode.getBytes("Windows-1252"), "UTF-8");

But all it does is replace these special characters with a ?.

When I open what I am trying to print here using a downloaded .html file from Chrome with Notepad++, it says (in the bottom right corner) UNIX and Windows-1252. That's all I know about the encoded file.

What more steps can I take to figure out what is wrong?

--AND--

How can I convert this file so that I can properly read and print it in Java?

Sorry if this question is kind of stupid... I simply don't know any better and couldn't find anything on the internet.

Maverick283
  • 1,284
  • 3
  • 16
  • 33
  • 1
    Well, for starters, specify an encoding in your `InputStreamReader`... You don't specify any – fge Mar 30 '15 at 20:19
  • @fge That sounds good and I'll be right on it but how do I do that if I don't even know what Encoding I have to use? – Maverick283 Mar 30 '15 at 20:20
  • You should know the encoding - otherwise it's just not possible to do a right decoding. I would print out the values of all bytes to see what is really in there. One byte per character? Which code for ä, which for ü and so on. Just to get forward. – chris Mar 30 '15 at 20:22
  • The "UNIX" in this case is just the line endings. Windows-1252 is a Windows encoding, usually disrecommended in favor of ISO-8859-1. Also, strings in Java are real strings with real characters; they don't need any special handling to "print correctly". This means you have to use the correct decoder when getting a string in the first place - or rely on automatic detection. – bzlm Mar 30 '15 at 20:22
  • See my answer; I'll be editing it with more details, which I have repeated over and over again, but... – fge Mar 30 '15 at 20:23
  • Like chris said you have to know the encoding of the file you download to encode it correctly. There are algorithms which try to find out the right encoding. Have a look at [java-how-to-determine-the-correct-charset-encoding-of-a-stream](http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream) this seems to be related to your problem. – tomse Mar 30 '15 at 20:26
  • Okay, thanks to @tomse I figured out that the encoding was indeed (as Notepad++ suggested) `Windows-1252` and with @fge inspiration I put in `... InputStreamReader(myUrl.openStream(),"Windows-1252")` which led to a result. Thanks for y'alls help! – Maverick283 Mar 30 '15 at 20:29
  • @tomse, most web servers specify the transport encoding and the content encoding in responses. Doesn't Java already take this into account? – bzlm Mar 30 '15 at 20:29
  • 1
    @Maverick283, remember that that will stop working when the web server decides to respond with something other than a legacy 8-bit encoding. – bzlm Mar 30 '15 at 20:30
  • Answer edited. It does not tell the full story, but it tells a good part of it... – fge Mar 30 '15 at 20:32
  • @bzlm So then I would try to implement a encoding detector as tomse suggested? – Maverick283 Mar 30 '15 at 20:32
  • @Maverick283 I can't believe such a detector is not covered by the Java URL downloading APIs, but yes. Windows-1252 isn't very common these days, so basing your code on it seems brittle. – bzlm Mar 30 '15 at 20:33
  • @bzlm I guess this might be a problem if you are trying to read multible websites. In this case it is gonna be the same website every single time and if the admins of that website decide to change the website, then I'm gonna start changing the code accordingly. But in general this would sure be good advise, thank you! – Maverick283 Mar 30 '15 at 20:35
  • @bzlm is _"transport encoding"_ the same as the file encoding of the retrieved file? I'm not an expert of this but I'm wondering how this should work. E.g. if there are many files with different enconding stored on the server, the server should have to same problem to figure out the right file encoding or not? – tomse Mar 30 '15 at 20:48
  • @tomse The transport encoding is on top of the content encoding, so not super important here. If the server cannot determine the encoding of a file, it usually does not specify the encoding in responses, making it a concern of the consumer. What do the HTTP response headers say exactly for your Windows-1252 response? – bzlm Mar 30 '15 at 21:19
  • @bzlm To answer your inital question, I'm not aware of a possibility to simply tell java to take the HTTP response header encoding information into account. – tomse Mar 30 '15 at 21:27

1 Answers1

2

OK, so you are mixing a lot of things here.

First of all, you do:

new InputStreamReader(myUrl.openStream())

this wil open a reader, yes; however, it will do so using your default JRE/OS Charset. Maybe not what you want.

Try and specify that you want UTF_8 (note, Java 7+ code):

try (
    final InputStream in = myUrl.openStream();
    final Reader reader = new InputStreamReader(in, StandardCharsets.UTF_8);
) {
    // read from the reader here
}

Now, what you are mixing...

You read from an InputStream; an InputStream only knows how to read bytes.

But you want text; and in Java, text means a sequence of chars.

Let us forget for a moment that you want chars and focus on the fact that you want text; let us substitute a char for a carrier pigeon.

Now, what you need to do is to transform this stream of bytes into a stream of carrier pigeons. For this, you need a particular process. And in this case, the process is called decoding.

Back to Java, now. There also exists a process which does the reverse: encoding a stream of carrier pigeons (or chars) into a stream of bytes.

The trick... There exist several ways to do that; Unicode refers to them as character codings; and in Java, the base class which provides both encoders and decoders is a Charset.

Now, an InputStreamReader accepts a Charset as a second argument... Which you should ALWAYS specify. If you DO NOT, this:

new InputStreamReader(in);

will be equivalent to:

new InputStreamReader(in, Charset.defaultCharset());

and Charset.defaultCharset() is Not. Guaranteed. To. Be. The. Same. Amongst. Implementations. Of. JREs.

fge
  • 119,121
  • 33
  • 254
  • 329