Java byte to String encoding problem on Linux

Question

I am implementing a piece of software that works like this:

I have a Linux server running a vt100 terminal application that outputs text. My program telnets the server and reads/parses bits of the text into relevant data. The relevant data is sent to a small client run by a webserver that outputs the data on a HTML page.

My problem is that certain special characters like "åäö" is outputted as questionmarks (classic).

Background:
My program reads a byte stream using Apache Commons TelnetClient. The byte stream is converted into a String, then the relevant bits is substring'ed and put back toghether with separator characters. After this the new string is converted back into a byte array and sent using a Socket to the client run by the webserver. This client creates a string from the received bytes and prints it out on standard output, which the webserver reads and outputs HTML from.

Step 1: byte[] --> String --> byte[] --> [send to client]

Step2: byte[] --> String --> [print output]

Problem:
When i run my Java program on Windows all characters, including "åäö", are outputted correctly on the resulting HTML page. However if i run the program on Linux all special characters are converted into "?" (questionmark).

The webserver and the client is currently running on Windows (step 2).

Code:
The program basically works like this:

My program:

byte[] data = telnetClient.readData() // Assume method works and returns a byte[] array of text.

// I have my reasons to append the characters one at a time using a StringBuffer.
StringBuffer buf = new StringBuffer();
for (byte b : data) {
    buf.append((char) (b & 0xFF));
}

String text = buf.toString();

// ...
// Relevant bits are substring'ed and put back into the String.
// ...

ServerSocket serverSocket = new ServerSocket(...);
Socket socket = serverSocket.accept();
serverSocket.close();

socket.getOutputStream.write(text.getBytes());
socket.getOutputStream.flush();

The client run by webserver:

Socket socket = new Socket(...);

byte[] data = readData(socket); // Assume this reads the bytes correctly.

String output = new String(data);

System.out.println(output);

Assume the synchronizing between the reads and writes works.

Thoughts:
I have tried with different ways of encoding and decoding the byte array with no results. I am a little new to charset encoding issues and would like to get some pointers. The default charset in Windows "WINDOWS 1252" seems to let the special characters through all the way server to webserver, but the when run on a Linux computer the default charset is different. I have tried to run a "Charset.defaultCharset().forName()" and it shows that my Linux computer is set to "US-ASCII". I thought that Linux defaulted to "UTF-8"?

How should I do to get my program to work on Linux?

possible duplicate of [What is character encoding and why should I bother with it](http://stackoverflow.com/questions/10611455/what-is-character-encoding-and-why-should-i-bother-with-it) — Raedwald, Apr 10 '15 at 12:25

score 8 · Accepted Answer · answered Aug 11 '11 at 12:13

8

It's generally a bad idea to rely on the platform default encoding, especially for a network communication protocol.

Both new String() and String.getBytes() are overloaded to allow you to specify the encoding. Since you control encoding as well as decoding, simply use UTF-8 (hardcoded).

Also check your code for uses of FileInputStream, FileOutputStream, InputStreamReader and OutputStreamWriter, all of which ptentially rely on the platform default encoding (the first two, exclusively, which makes them pretty useless).

answered Aug 11 '11 at 12:13

Michael Borgwardt

342,105
78
482
720

How should I decode using a StringBuffer? `buf.append(new String(new byte[] { b }, "UTF-8"))`? But otherwise I should always decode and encode with UTF-8 throughout the whole program (including the client)? – Felix Glas Aug 11 '11 at 12:30
@snipes83: Drop the StringBuffer thing. It's a pointless, error-prone complication and would require a lot more complex logic to work for UTF-8 since it will use more than one byte for characters outside ASCII. Or explain your reasons for wanting to do it that way and we may find a better solution to achieve what you actually want. Otherwise, yes, use UTF-8 everywhere. And avoid converting between Strings and bytes as much as possible. – Michael Borgwardt Aug 11 '11 at 12:38
The reason for using the StringBuffer is because i'm interpeting a VT100 terminal frame wich has 80 columns width and 24 lines height. To keep track of the formatting of each character (color, bold, reversed background) i use a separate identical matrix which holds this information. When creating the String i check each raw byte in a matrix against the format matrix and creates a character preceeded and succeeded by a format tag in XML. This is complex and I rather don't make any changes. – Felix Glas Aug 11 '11 at 12:50
@snipes83: Hm, in that case you have a second encoding to consider - what encoding is the telnet client using? If it's ISO-8859-1, then your current code will work unmodified, but you should document this assumption because it's very implicit (in the fact that the first 256 code points of Unicode are backwards-compatible to ISO-8859-1). It would probably be better to first convert the bytes you got from the telnet client to a String in one go and explicitly using the encoding, and then do the formatting matrix thing on the characters in that string. – Michael Borgwardt Aug 11 '11 at 13:13
I should mention that if i use "java -jar -Dfile.encoding=ISO-8859-1 myprogram.jar" it works without problems on **Linux** and outputs all special characters correctly. This is however a very bad solution and i want to get the encodings right in the code. Above solution implies that the TelnetClient is outputting data encoded by some sort of ISO-8859-1 compatible charset? – Felix Glas Aug 11 '11 at 14:48

score 3 · Answer 2 · answered Aug 11 '11 at 12:14

String(byte[] bytes, String encoding) is your friend. Just read all raw bytes into a byte buffer and use this constructor to decode the bytes into a Java string. (or: transcode to UTF-16, the internal character encoding)

The method getBytes(String encoding) will encode a String to bytes.

score 0 · Answer 3 · edited May 23 '17 at 12:14

The key detail is what is the encoding of the data returned from telnetClient.readData()? It sounds like it is windows-1252. With that in mind, you have a couple of options. You can explicitly set the encoding on all of the String operations to windows-1252:

text.getBytes("windows-1252");

String output = new String(data, "windows-1252");

Or you can use java.nio.charset.Charset to convert the telnet data to something less platform specific like UTF-8 following this example: Converting UTF-8 to ISO-8859-1 in Java - how to keep it as single byte -- still setting the characters sets in the String operations explicitly though.

Java byte to String encoding problem on Linux

3 Answers3

Linked