In Java, String
≠ byte[]
.
byte[]
represents raw binary data.
String
represents text, which has an associated charset/encoding to be able to tell which characters it represents.
Binary Data ≠ Text.
Text data inside a String
has Unicode/UTF-16 as charset/encoding (or Unicode/mUTF-8 when serialized). Whenever you convert from something that is not a String
to a String
or viceversa, you need to specify a charset/encoding for the non-String
text data (even if you do it implicitly, using the platform's default charset).
A PNG file contains raw binary data that represents an image (and associated metadata), not text. Therefore, you should not treat it as text.
\x89PNG
is not text, it's just a "magic" header for identifying PNG files. 0x89
isn't even a character, it's just an arbitrary byte value, and its only sane representations for display are things like \x89
, 0x89
, ... Likewise, PNG
there is in reality binary data, it could as well have been 0xdeadbeef
and it would have changed nothing. The fact that PNG
happens to be human-readable is just a convenience.
Your problem comes from the fact that your protocol mixes text and binary data, while Java (unlike some other languages, like C) treats binary data differently than text.
Java provides *InputStream
for reading binary data, and *Reader
for reading text. I see two ways to deal with input:
- Treat everything as binary data. When you read a whole text line, convert it into a
String
, using the appropriate charset/encoding.
- Layer a
InputStreamReader
on top of a InputStream
, access the InputStream
directly when you want binary data, access the InputStreamReader
when you want text.
You may want buffering, the correct place to put it in the second case is below the *Reader
. If you used a BufferedReader
, the BufferedReader
would probably consume more input from the InputStream
than it should. So, you would have something like:
┌───────────────────┐
│ InputStreamReader │
└───────────────────┘
↓
┌─────────────────────┐
│ BufferedInputStream │
└─────────────────────┘
↓
┌─────────────┐
│ InputStream │
└─────────────┘
You would use the InputStreamReader
to read text, then you would use the BufferedInputStream
to read an appropriate amount of binary data from the same stream.
A problematic case is recognizing both "\r"
(old MacOS) and "\r\n"
(DOS/Windows) as line terminators. In that case, you may end up reading one character too much. You could take the approach that the deprecated DataInputStream.readline()
method took: transparently wrap the internal InputStream
into a PushbackInputStream
and unread that character.
However, since you don't appear to have a Content-Length, I would recommend the first way, treating everything as binary, and convert to String
only after reading a whole line. In this case, I would treat the MIME delimiter as binary data.
Output:
Since you are dealing with binary data, you cannot just println()
it. PrintStream
has write()
methods that can deal with binary data (e.g: for outputting to a binary file).
Or maybe your data has to be transported on a channel that treats it as text. Base64 is designed for that exact situation (transporting binary data as ASCII text). Base64 encoded form uses only US_ASCII characters, so you should be able to use it with any charset/encoding that is a superset of US_ASCII (ISO-8859-*, UTF-8, CP-1252, ...). Since you are converting binary data to/from text, the only sane API for Base64 would be something like:
String Base64Encode(byte[] data);
byte[] Base64Decode(String encodedData);
which is basically what the internal java.util.prefs.Base64
uses.
Conclusion:
In Java, String
≠ byte[]
.
Binary Data ≠ Text.