0

When I use Java,writeChars method of DataOutputStream and write method of FileWriter,I get different results.writeChars method leaves spaces in front of each character but write method of FileWriter does not.

When I use Java,FileOutputStream and write for example

FileOutputStream fis=new FileOutputStream(my_file);
fis.write("ABC".getBytes());

I see ABC in txt file but when I write

DataOutputStream dos=new DataOutputStream(my_file);
dos.writeChars("ABC");

I see A B C(there are space in front of each character,which is because char takes two bytes in java. But this is not the case when I use FileWriter.It writes only ABC without spaces.What is the reason of that? How does FileWriter write characters consuming only one byte? Thanks in advance

  • What do you mean by 'spaces'? Bytes with value 0x20, or gaps on a screen display? – undefined symbol Mar 06 '23 at 13:28
  • I mean gaps on a screen display.When I try to read them using readByte(),I get 0.When I write character 'A' and call readByte method two times to read it,I get 0 and 65 respectively. – Ahmet Cicek Mar 06 '23 at 13:40
  • Generally speaking: the internal layout/encoding of String or any other object has very little to do with the bytes that end up written to a file or stream. Both `OutputStream.write()` (as well as `String.getBytes()`) and `DataOutputStream.writeChars()` are exactly specified as to what output they produce and that specification doesn't depend on the internal structure/representation of `String` in Java. – Joachim Sauer Mar 06 '23 at 13:40
  • [This answer of mine](https://stackoverflow.com/a/5078365/40342) is about a very similar question in principle (despite looking at it from the reading perspective, most of the answer still applies). – Joachim Sauer Mar 06 '23 at 13:42
  • 1
    from the [documentation](https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/io/DataOutputStream.html#writeChar(int)) of `writeChar()`, referenced by the documentation of `writeChars()`: "*Writes a char to the underlying output stream as a 2-byte value, high byte first.*" - apparently the app used to show the file does *interpret* the high order byte (`0` for posted characters) as a space – user16320675 Mar 06 '23 at 13:59
  • I made a mistake writing CharWriter,it should be FileWriter.I fixed it. – Ahmet Cicek Mar 06 '23 at 14:49

2 Answers2

3

Read the documentation.

Here is the javadoc of DataOutputStream.writeChars:

Writes a string to the underlying output stream as a sequence of characters. Each character is written to the data output stream as if by the writeChar method. If no exception is thrown, the counter written is incremented by twice the length of s.

So.. doesn't explain anything, other than 'look at the docs of writeChar, so let's do that:

Writes a char to the underlying output stream as a 2-byte value, high byte first. If no exception is thrown, the counter written is incremented by 2.

Ah. It writes 2 bytes, 'high byte' first. That explains why you have 6 things in your file when you writeChars "ABC". Each character takes up 2 bytes.


Encoding

Computers are number-based. Hence, computers are engineered in the obvious way as far as text is concerned: There is a table that maps a character, such as 'a', to a number, such as '97'.

There are many such mappings available. The most popular one, by far, is the unicode mapping. This maps millions of characters. Not just 'a' and '0', but also 'tab', 'make a bell sound', 'I am not a real character I am just here so you can figure out the byte ordering', 'please put an umlaut on the previously rendered character', 'please switch to right-to-left rendering', 'the skin color of the next emoji is middle brown', and simply, 'poop emoji'.

A byte is 8 bits. Just like a single light switch (1 bit) can only represent 2 things ('on', and 'off', and that's all you can do with a single light switch), 8 bits can represent 256 unique states.

A byte is not enough to store unicode!! - that's obvious then, if there are millions of unicode, one byte is not necessarily enough.

Given a bunch of text, we can put it in terms of 'a sequence of unicode numbers'. For example, the text "Hi ☃" (That's the unicode snowman, yes, that's in there too) is the sequence: 72, 105, 32, 9731.

We now need a way to store such a sequence back to a file. We can't just stick 72105329731 in there, because we have no idea where one character starts and the other ends. We could just use 4 bytes per character (enough to store up to 4 billion or so, all unicode code points fit easily), but that's wasteful.

So, a very popular encoding is called UTF-8. UTF-8 encodes unicode such that all things with a 'codepoint' between 0 and 127 take only 1 byte. Yes, it's a variable amount of bytes per codepoint.

Hence, "ABC" stored in UTF-8 is only 3 bytes large, because the codepoints for all 3 characters are in this 0-127 range. That range is called ASCII and contains the most obvious simple english-oriented characters: A-Z, a-z, 0-9, $, #, ", @, but no € or £ or ü or ç. All those characters (€, ç, etc) are in unicode, but storing them with the UTF-8 encoding system means they end up stored as 2 bytes (or possibly 3, I'd have to look up the codepoints).

Unicode isn't the only table around (it's designed to be universal, but before unicode, there were lots of more specific tables, such as ISO-8859-5 which is a page specifically designed for cyrillic (used in Ukraine, Russia, Serbia, Bulgaria, for example), these are much smaller (only 256 entries in them.. so that you can store them 1 byte per character, which, on 30 year old hardware, was kinda required for things to work quickly).

UTF-8 isn't the only system around to store these number sequence. There's also UCS4 (a.k.a. UTF-32) and UTF16, which even comes in variants (UTF16BE and UTF16LE).

UTF-8, UTF16-BE, ISO-8859-5 - these are called charset encodings.

Hence, anytime you have a string and turn it into bytes and vice versa, you ARE applying a charset encoding.

If you encode a string to bytes using encoding A and then convert the bytes back to a string using encoding B, you get gobbledygook out. This even has a fun name: Mojibake.

When you write new String(someByteArr), or someString.getBytes(), stop doing that, because the JVM has to imagine what charset encoding you want. On JVM17+, it's always UTF-8, before that, it's "whatever your OS decided is the default for the system", which on linux is usually UTF-8 but on windows, for example, tends to be CP252 or some other windows-specific ISO-8859-1-esque simpler encoding. It's confusing you.

Point is, "ABC", in almost all encodings, ends up being the same sequence of bytes... and that sequence consists of 3 bytes only.

Characters

In java, char is half of a unicode surrogate pair. Basically, char values are usually unicode, but can not be larger than 16 bit, meaning, doesn't go beyond 65536. That covers a ton of unicode (even the snowman), but not everything. Emojis go further than 65536. When java was designed, 16 bit per char was seen as an acceptable compromise: Covers most characters you'd ever want to use, and still 'only' 16 bit so e.g. strings represented in memory won't be too hard on your RAM requirements.

Hence, even though unicode can require up to 4 bytes and UTF-8 can spend up to 4 bytes to store a single character, char only takes 2 (and, as a consequence, something like an emoji actually takes up 2 chars in a string in java; things beyond the range end up as a surrogate pair: 2 chars that can be reconstructed into a single unicode code point).

Hence, why writeChar spends 2 bytes. It doesn't do the fancy UTF-8 thing of having an algorithm that stores the value using a variable amount of bytes.

Generally, you don't want to use writeChar. It's needlessly inefficient on simplistic mostly ASCII text, and it's not more efficient when your text body is all emojis either. It's from the days of yore when 'just encode it as UTF-8' was seen as particularly convoluted and problematic (not a deserved reputation, and these days everybody loves UTF-8 and has seen the light).

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
0

Your FileOutputStream is given a byte array to write. You converted a String to the byte array. Since the String only contains characters in the range 0x00 to 0x7F, each character takes only one byte.

That is, ABC.toBytes().length is 3.

String to bytes is a conversion operation, see the doc.

Your DataOutputStream is intended to write binary data, and that is what it does. Each character takes two bytes. One of the bytes will be 0x00, which you are apparently displaying as a space.

See the documentation for writeChar (writeChars is like a sequence of writeChar calls).