Read the documentation.
Here is the javadoc of DataOutputStream.writeChars
:
Writes a string to the underlying output stream as a sequence of characters. Each character is written to the data output stream as if by the writeChar method. If no exception is thrown, the counter written is incremented by twice the length of s.
So.. doesn't explain anything, other than 'look at the docs of writeChar
, so let's do that:
Writes a char to the underlying output stream as a 2-byte value, high byte first. If no exception is thrown, the counter written is incremented by 2.
Ah. It writes 2 bytes, 'high byte' first. That explains why you have 6 things in your file when you writeChars
"ABC"
. Each character takes up 2 bytes.
Encoding
Computers are number-based. Hence, computers are engineered in the obvious way as far as text is concerned: There is a table that maps a character, such as 'a', to a number, such as '97'.
There are many such mappings available. The most popular one, by far, is the unicode mapping. This maps millions of characters. Not just 'a' and '0', but also 'tab', 'make a bell sound', 'I am not a real character I am just here so you can figure out the byte ordering', 'please put an umlaut on the previously rendered character', 'please switch to right-to-left rendering', 'the skin color of the next emoji is middle brown', and simply, 'poop emoji'.
A byte is 8 bits. Just like a single light switch (1 bit) can only represent 2 things ('on', and 'off', and that's all you can do with a single light switch), 8 bits can represent 256 unique states.
A byte is not enough to store unicode!! - that's obvious then, if there are millions of unicode, one byte is not necessarily enough.
Given a bunch of text, we can put it in terms of 'a sequence of unicode numbers'. For example, the text "Hi ☃" (That's the unicode snowman, yes, that's in there too) is the sequence: 72, 105, 32, 9731.
We now need a way to store such a sequence back to a file. We can't just stick 72105329731 in there, because we have no idea where one character starts and the other ends. We could just use 4 bytes per character (enough to store up to 4 billion or so, all unicode code points fit easily), but that's wasteful.
So, a very popular encoding is called UTF-8. UTF-8 encodes unicode such that all things with a 'codepoint' between 0 and 127 take only 1 byte. Yes, it's a variable amount of bytes per codepoint.
Hence, "ABC" stored in UTF-8 is only 3 bytes large, because the codepoints for all 3 characters are in this 0-127 range. That range is called ASCII and contains the most obvious simple english-oriented characters: A-Z, a-z, 0-9, $, #, ", @, but no € or £ or ü or ç. All those characters (€, ç, etc) are in unicode, but storing them with the UTF-8 encoding system means they end up stored as 2 bytes (or possibly 3, I'd have to look up the codepoints).
Unicode isn't the only table around (it's designed to be universal, but before unicode, there were lots of more specific tables, such as ISO-8859-5 which is a page specifically designed for cyrillic (used in Ukraine, Russia, Serbia, Bulgaria, for example), these are much smaller (only 256 entries in them.. so that you can store them 1 byte per character, which, on 30 year old hardware, was kinda required for things to work quickly).
UTF-8 isn't the only system around to store these number sequence. There's also UCS4 (a.k.a. UTF-32) and UTF16, which even comes in variants (UTF16BE and UTF16LE).
UTF-8, UTF16-BE, ISO-8859-5 - these are called charset encodings.
Hence, anytime you have a string and turn it into bytes and vice versa, you ARE applying a charset encoding.
If you encode a string to bytes using encoding A and then convert the bytes back to a string using encoding B, you get gobbledygook out. This even has a fun name: Mojibake.
When you write new String(someByteArr)
, or someString.getBytes()
, stop doing that, because the JVM has to imagine what charset encoding you want. On JVM17+, it's always UTF-8, before that, it's "whatever your OS decided is the default for the system", which on linux is usually UTF-8 but on windows, for example, tends to be CP252 or some other windows-specific ISO-8859-1-esque simpler encoding. It's confusing you.
Point is, "ABC", in almost all encodings, ends up being the same sequence of bytes... and that sequence consists of 3 bytes only.
Characters
In java, char
is half of a unicode surrogate pair. Basically, char
values are usually unicode, but can not be larger than 16 bit, meaning, doesn't go beyond 65536. That covers a ton of unicode (even the snowman), but not everything. Emojis go further than 65536. When java was designed, 16 bit per char was seen as an acceptable compromise: Covers most characters you'd ever want to use, and still 'only' 16 bit so e.g. strings represented in memory won't be too hard on your RAM requirements.
Hence, even though unicode can require up to 4 bytes and UTF-8 can spend up to 4 bytes to store a single character, char
only takes 2 (and, as a consequence, something like an emoji actually takes up 2 chars in a string in java; things beyond the range end up as a surrogate pair: 2 chars that can be reconstructed into a single unicode code point).
Hence, why writeChar
spends 2 bytes. It doesn't do the fancy UTF-8 thing of having an algorithm that stores the value using a variable amount of bytes.
Generally, you don't want to use writeChar
. It's needlessly inefficient on simplistic mostly ASCII text, and it's not more efficient when your text body is all emojis either. It's from the days of yore when 'just encode it as UTF-8' was seen as particularly convoluted and problematic (not a deserved reputation, and these days everybody loves UTF-8 and has seen the light).