Java byte stream non english characters

Question

I read this code. As xanadu.txt content use "test". The file has 4 bytes size. If I use debug to run out.write(c) one byte at time and after each time open the file outagain.txt (with notepad) I see successively: t-->te-->tes-->test. OK BUT if we change the content of the source file (xanadu.txt) to Greek (or other language) equivalent to test (τέστ) then the file now has 8 bytes size (I think because UTF we have 2 bytes per character). When debug again then it appears hieroglyphic character without meaning for each time out.write(c) runs. When the last byte (8th) prints then suddenly the original Greek word (τέστ) appears. Why? The same if we choose as destination the console stream (in netbeans) but in this case the strange characters remain at the end if debug but not if we run it normaly(!!!).

have you tried to convert the byte stream to a string and then using string.charAt() method to get each character from the string? — The KNVB, Nov 14 '20 at 14:30
@TheKNVB the problem, in principle, won't be solved that way. For example, a `String` consisting of a single emoji may have its `length()` equal to `2`, so `charAt` is not a universal answer. — Fureeish, Nov 14 '20 at 14:31
I have tried the following code, it works. public class MyClass { public static void main(String args[]) { String data="τέστ"; for (int i=0;i — The KNVB, Nov 14 '20 at 14:32
"*It works in this specific case*" is usually a bad answer, unless given more context and explanations. — Fureeish, Nov 14 '20 at 14:35
Αs I have understood Stackoverflow is aimed only at the most skillful questions from NASA and CERN researchers and up!!!! — nonlinearly, Nov 14 '20 at 16:40

score 3 · Answer 1 · answered Nov 14 '20 at 14:39

As you observe, a single char (16 bits in Java internal representation) turns into a variable number of bytes in a byte-stream representation, in particular UTF-8.

(Some characters occupy two char values; I shall ignore those, but the answer still applies, only more so)

If you're outputting 'byte-wise' as in your experiment, in some cases you'll have output a fractional character. That is an illegal sequence that makes no sense; some software (such as Notepad) will nevertheless try to make sense of it. That may even include guessing at the encoding. For example, I don't know this to be the case, but if the file is not valid UTF-8 in its first several bytes -- and we know your half-a-character output is not valid UTF-8 -- than maybe Notepad guesses at an entirely different encoding, that treats the byte sequence as a valid representation of entirely different characters.

tl;dr - garbage out, garbage displayed.

score 3 · Accepted Answer · answered Nov 14 '20 at 14:53

Modern computers have this gigantic table with 4 billion characters in it. Each character is identified by a single 32-bit number. All characters you can think of are in here; from the basic 'test' to 'τέστ' to snowman ( ☃ ), to special non-visible ones that indicate a right-to-left spelled word is coming up, to a bunch of ligatures (such as ﬀ - which is a single character representing the ff ligature), to emoji, coloured and all: .

This entire answer is essentially a sequence of these 32-bit numbers. But how would you like to store these in a file? That's where 'encoding' comes in. There are many, many encodings, and a crucial problem is that (almost) no encodings are 'detectable'.

It's like this:

If a complete stranger walks up to you and says "Hey!", what language are they speaking? Probably english. But maybe dutch, which also has 'Hey!'. It could also be japanese and they're not even greeting you, they're saying 'Yes' (more or less). How would you know?

The answer is, either from external context (if you're in the middle of Newcastle, UK, it's probably english), or because they explicitly tell you, but one is, well, external, and the other isn't common practice.

Text files are the same way.

They just contain the encoded text, they do not indicate what encoding it is. That means you need to tell the editor, or your newBufferedReader in java, or your browser when saving that txt content, what encoding you want. However, because that's annoying to have to do every time, most systems have a default choice. Some text editors even try to figure out what encoding it is, but just like that person say 'Hey!' to you might be english or japanese, with wildly different interpretations, the same happens with this semi-intelligent guessing at charset encoding.

This gets us to the following explanation:

You write τέστ in your editor and hit 'save'. What is your editor doing? Is it saving in UTF-16? UTF-8? UCS-4? ISO-8859-7? Completely different files are produced for all of these encodings! Given that it made 8 bytes, that meansa it's UTF-16 or UTF-8. Probably UTF-8.
You then copy these bytes over one by one, which is problematic: In UTF-8, a single byte can be half of a character. (You said: UTF-8 stores characters as 2 bytes; that's not true, UTF-8 stores characters such that every character is 1, 2, 3, or 4 bytes; it's variable length per byte! - each character in τέστ is stored as 2 bytes, though) - that means if you've copied over, say, 3 bytes, your text editor's ability to guess what it might be is severely hampered: It might guess UTF-8 but then realize that it isn't valid UTF-8 at all (because of that half-of-a-character you ended up with), so it guesses wrong, and shows you gobbledygook.

The lesson to learn here is twofold:

When you want to process characters, use char, Reader, Writer, String, and other character-oriented things.
When you want to process bytes, use byte, byte[], InputStream, OutputStream, and other byte-oriented things.
Never make the mistake that these two are easily interchangible, because they are not. Whenever you go from one 'world' to the other, you MUST specify charset encoding, because if not, java picks 'platform default', which you don't want (because now you have software that depends on an external factor and which cannot be tested. Yikes).
Default to UTF-8 for everything you can.

Excellent. One last question. How can I print in the destination file the number of each byte instead of characters? Since we are talking about raw byte streams and since c represents a byte between 1 and 256 I would expect that out.write(c) would prints the number of each byte if we don't specify some coding that transforms these sequence depending on the coding. — nonlinearly, Nov 14 '20 at 15:49
@nonlinearly out.write() is specced to write a byte. Not a bunch of ASCII that represents that byte in characters (which, heh, gets us back to encoding! Now you're writing chars instead of bytes!). Fortunately, the digits 0 through 9 are stored in the same byte sequence in almost all encodings. try `out.write(("" + number).getBytes(StandardCharsets.US_ASCII));` the mouthful is because you're trying to write chars to a byte stream which is not what you should be doing. Alternatively, open that outfile _as a char stream_ (new OutputStreamWriter(theOutputStream, StandardCharsets.US_ASCII)). — rzwitserloot, Nov 14 '20 at 16:11

Basil Bourque · Answer 3 · 2020-11-15T03:38:52.833

tl;dr

Read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Do not parse text files by octets (bytes). Use classes purpose-built for handling text. For example, use Files and its readAllLines method.

Details

Notice at the bottom of that tutorial page the caution that this is not the proper way to handle text files:

CopyBytes seems like a normal program, but it actually represents a kind of low-level I/O that you should avoid. Since xanadu.txt contains character data, the best approach is to use character streams, as discussed in the next section.

Text files may or may not use single octets to represent single characters, such as US-ASCII files. Your example code assumes one octet per character, which works for test as the content but not for τέστ as the content.

As a programmer, you must know from the publisher of your data file what encoding was used in writing the data representing the original text. Generally best to use UTF-8 encoding when writing text.

Write a text file with two lines:

test τέστ

…and save using a text-editor with an encoding of UTF-8.

Read that file as a collection of String objects.

Path path = Paths.get( "/Users/basilbourque/some_text.txt" );
try
{
    List < String > lines = Files.readAllLines( path , StandardCharsets.UTF_8 );
    for ( String line : lines )
    {
        System.out.println( "line = " + line );
    }
}
catch ( IOException e )
{
    e.printStackTrace();
}

When run:

line = test
line = τέστ

UTF-16 versus UTF-8

You said:

I think because UTF we have 2 bytes per character)

No such thing as “UTF”.

UTF-16 encoding uses one or more pairs of octets per character.
UTF-8 encoding uses 1, 2, 3, or 4 octets per character.

Text content such as τέστ can be written to a file using either encoding, UTF-16 or UTF-8. Be aware that UTF-16 is “considered harmful”, and UTF-8 is preferred generally nowadays. Note that UTF-8 is an superset of US-ASCII, so any US-ASCII file is also a UTF-8 file.

Characters as code points

If you want to example each character in text, treat them as code point numbers.

Never use the char type in Java. That type is unable to represent even half of the characters defined in Unicode, and is now obsolete.

We can interrogate each character in our example file seen above by adding these two lines of code.

IntStream codePoints = line.codePoints();
codePoints.forEach( System.out :: println );

Like this:

Path path = Paths.get( "/Users/basilbourque/some_text.txt" );
try
{
    List < String > lines = Files.readAllLines( path , StandardCharsets.UTF_8 );
    for ( String line : lines )
    {
        System.out.println( "line = " + line );
        IntStream codePoints = line.codePoints();
        codePoints.forEach( System.out :: println );
    }
}
catch ( IOException e )
{
    e.printStackTrace();
}

When run:

line = test
116
101
115
116
line = τέστ
964
941
963
964

If you are not yet familiar with streams, convert IntStream to a collection, such as a List of Integer objects.

Path path = Paths.get( "/Users/basilbourque/some_text.txt" );
try
{
    List < String > lines = Files.readAllLines( path , StandardCharsets.UTF_8 );
    for ( String line : lines )
    {
        System.out.println( "line = " + line );
        List < Integer > codePoints = line.codePoints().boxed().collect( Collectors.toList() );
        for ( Integer codePoint : codePoints )
        {
            System.out.println( "codePoint = " + codePoint );
        }
    }
}
catch ( IOException e )
{
    e.printStackTrace();
}

When run:

line = test
codePoint = 116
codePoint = 101
codePoint = 115
codePoint = 116
line = τέστ
codePoint = 964
codePoint = 941
codePoint = 963
codePoint = 964

Given a code point number, we can determine the intended character.

String s = Character.toString( 941 ) ; // έ character.

Be aware that some textual characters may be represented as multiple code points, such as a letter with a diacritical. (Text-handling is not a simple matter.)

Java byte stream non english characters

3 Answers3

tl;dr

Details

UTF-16 versus UTF-8

Characters as code points