Base64.Decoder returning foreign characters

Question

I am building a small application to turn the text in a text file to Base64 then back to normal. The decoded text always returns some Chinese characters in the beginning of the first line.

public EncryptionEngine(File appFile){
    this.appFile= appFile;
}


public void encrypt(){

    try {
        byte[] fileText = Files.readAllBytes(appFile.toPath());// get file text as bytes

        Base64.Encoder encoder = Base64.getEncoder();
        PrintWriter writer = new PrintWriter(appFile);

        writer.print("");//erase old, readable text
        writer.print(encoder.encodeToString(fileText));// insert encoded text
        writer.close();


    } catch (IOException e) {

        e.printStackTrace();
    }

}

public void deycrpt(){

    try {
        byte[] fileText = Files.readAllBytes(appFile.toPath());

        String s = new String (fileText, StandardCharsets.UTF_8);//String s = new String (fileText);


        Base64.Decoder decoder = Base64.getDecoder();
        byte[] decodedByteArray = decoder.decode(s);

        PrintWriter writer = new PrintWriter(appFile);
        writer.print("");
        writer.print(new String (decodedByteArray,StandardCharsets.UTF_8)); //writer.print(new String (decodedByteArray));
        writer.close();


    } catch (IOException e) {

        e.printStackTrace();
    }



}

Text FileBefore before encrypt():

cheese

tomatoes

potatoes

hams

yams

Text File after encrypt() //5jAGgAZQBlAHMAZQANAAoAdABvAG0AYQB0AG8AZQBzAA0ACgBwAG8AdABhAHQAbwBlAHMADQAKAGgAYQBtAHMADQAKAHkAYQBtAHMA

Text File After decrypt

뿯붿cheese

tomatoes

potatoes

hams

yams

Before encrypt() :

After decrypt() :

I'd strongly suspect inconsistent encodings being used. You haven't specified an encoding for either of your `PrintWriter`s. — Louis Wasserman, Apr 12 '18 at 22:36
I suspect the input text file starts with `byte order mark` (0xEF 0xBB 0xBF). You can't see `byte order mark` by Notepad on Windows. — , Apr 13 '18 at 00:32
@saka1029 That would be because the BOM is metadata, not text. Unicode-compliant text viewers and processors strip it off. If you have a hex byte viewer extension for Notepad++, it will show it, though. — Tom Blodget, Apr 13 '18 at 14:14

score 1 · Answer 1 · answered Apr 13 '18 at 13:14

Your input file is UTF-16, not UTF-8. It begins with FF FE, the little-endian byte order mark. StandardCharsets.UTF_16 will handle this correctly. (Or instead, set your text editor to UTF-8 instead of UTF-16.)

When you decoded fffe as UTF-8, you got two replacement characters "��", one for each of the two bytes that was not valid in UTF-8. Then when you printed this out, each replacement character '�' was encoded as ef bf bd in UTF-8. Then you interpreted the result as UTF-16, taking them in groups of two, reading it as efbf bdef bfbd. The remainder of the file was UTF-16 the whole time, but the null bytes will safely round-trip.

(If the file were ascii text encoded as UTF-16 without a byte-order mark, you would not have noticed how broken this was!)

Tom Blodget · Answer 2 · 2018-04-14T18:10:59.513

Your encrypt and decrypt functions don't make the same assumptions. encrypt Base64-encodes any file and is just fine except for the variable names and comments that suggest that the file is a text file. It need not be.

decrypt reverses the Base64-encoded data back to bytes but then "overprocesses" by assuming that the bytes were text encoding with UTF-8 and decoding then and re-encoding them before writing them to the file. If the assumption was true, it would just be a NOP; It's clearly not true in your case and it mangles the data.

Perhaps you did that because you were trying to use a PrintWriter. In Java (and .NET), the multiple stream and file I/O classes are often confusing—expecially considering their decades-long evolution. Sometimes there is one that does exactly what you need but it could be hard to find; other times, there isn't. And, sometimes, a commonly used library like Apache Commons fills the gap.

So, just write the bytes to the file. There are lots of modern and historical options as explained in the answers to this direct question byte[] to file in Java. Here's one with Files.write:

Files.write(appFile.toPath(), decodedByteArray, StandardOpenOption.CREATE);

Note: While Base64 possibly would have been considered encryption (and cracked) a couple of hundred years ago, it's not intended for that purpose. It's a bit dangerous (and confusing) to call it as such.

Base64.Decoder returning foreign characters

2 Answers2