Reading a file using utf-8 that is encoded in utf-8 doesn't work, but reading the same file using "windows-1252" or "iso-8859-1" does

Question

What is happening here? Why when I read the file using utf-8 does it output questionmarks in the console?

This is a minimal working example:

import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
    
import static org.apache.commons.io.FileUtils.readFileToString;
import static org.apache.commons.io.FileUtils.writeStringToFile;
    
public class Main {
    
    public static void main(String... args) throws IOException {
    
        System.out.println("---------");
        System.out.println(Charset.defaultCharset());
        System.out.println("æ ø å");
        System.out.println("æ ø å");
        System.out.println("æ ø å");
    
        File inputFile  = new File(System.getProperty("user.dir") + "/input.md");
        File outputFile = new File(System.getProperty("user.dir") + "/output.md");
    
        String content, encoding;
    
        System.out.println("--------- windows-1252");
        encoding = "windows-1252";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        System.out.println("--------- iso-8859-1");
        encoding = "iso-8859-1";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        System.out.println("--------- utf-8");
        encoding = "utf-8";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        writeStringToFile(outputFile, content, encoding);
    
    }
    
}

Where input.md contains: (encoded in UTF-8)

This is input.md. 'æ' 'ø' 'å'

Running the above code yields

---------
windows-1252
æ ø å
æ ø å
æ ø å
--------- windows-1252
This is file C. 'æ' 'ø' 'å'.
--------- iso-8859-1
This is file C. 'æ' 'ø' 'å'.
--------- utf-8
This is file C. '�' '�' '�'.

Why do I get � when I read the file using UTF-8? This is especially weird since the file is encoded in UTF-8.

UPDATE: My console is set to "UTF-8":

Here is a screenshot of the hex values of each char in string extracted from the input file:

Here is a better screenshot of the hex isolated:

An image of code is useless. We can’t copy and paste it, we can’t run it or test it, we can’t search it, sight impaired users can’t do anything with it, and speaking for myself, I can barely see it when it’s scaled to fit in the page. Please edit your question and augment that image with text—preferably one code-formatted text block containing your code, and another containing your output. — VGR, Jan 25 '21 at 15:09
Does it help to add `systemProp.file.encoding=utf-8` into `gradle.properties` file in the project root directory? Does it help to switch to **IntelliJ IDEA** in the Settings (Preferences on macOS) | Build, Execution, Deployment | Build Tools | Gradle | **Build and run using**? — Andrey, Jan 25 '21 at 17:16
Also try to set UTF-8 for the Settings (Preferences on macOS) | Editor | File Encodings | **Project Encoding**. — Andrey, Jan 25 '21 at 17:19
@Andrey Everything was already sat to UTF-8 in the File-Encodings tab under settings. [screenshot](https://i.gyazo.com/fe716d80658af416987e46f5d1aedb56.png). And no it doesn't help. — Simon Pedersen, Jan 25 '21 at 17:26

score 1 · Answer 1 · answered Jan 25 '21 at 19:11

The code looks fine to me, and your output.md file looks OK. So this is most likely just an issue with the console output.

The Unicode characters you are experimenting with are encoded as the same single bytes in both Windows-1252 and ISO-8859-1 (æ = 0xE6, ø = 0xF8, å = 0xE5), but are encoded as multiple bytes in UTF-8 (æ = 0xC3 0xA6, ø = 0xC3 0xB8, å = 0xC3 0xA5).

Reading a UTF-8 encoded file as either Windows-1252 or ISO-8859-1 will decode each byte individually, producing a string containing a separate char for each byte, and those chars will have the same numeric values as the bytes. So, you should be ending up with a string containing chars 0x00C3 0x00A6, 0x00C3 0x00B8, and 0x00C3 0x00A5. Outputting those chars to the console as Windows-1252 should be showing as Ã¦ Ã¸ Ã¥, not as æ ø å.

On the other hand, reading a UTF-8 encoded file as UTF-8 will decode the file properly, producing a string with chars 0x00E6, 0x00F8, and 0x00E5. Writing that string to a UTF-8 encoded file should be producing the correct byte sequences (0xC3 0xA6, 0xC3 0xB8, and 0xC3 0xA5), but outputting that same string as Windows-1252 risks data loss, but you should be seeing æ ø å as expected, since Windows-1252 does support those Unicode characters.

So, your results are actually backwards from what I would expect. Even though Charset.defaultCharset() is reporting Windows-1252, I suspect your console is actually using a different charset for its output.

I suggest you print out the numeric values of the individual chars of the content string to see exactly how input.md is actually being decoded by each encoding. You should be getting the char values I mentioned above.

EDIT: I fixed the issue, you were right it was issues related to the console output. — Simon Pedersen, Jan 25 '21 at 21:20

score 1 · Answer 2 · answered Jan 25 '21 at 21:22

For people with similar issues, the problem lies with the encoding of the console (as @Remy Lebeau points out too).

I fixed the issue by following this answer

Actually, I followed @Nicolas answer in the comment to eh mentioned answer:

This is also accessible from Help > Edit custom VM options... then restart IntelliJ. I literally tried everything: changing encoding settings everywhere in IntelliJ, changing JVM options set by properties file, build.gradle file, IntelliJ, run configuration, environment variable, etc. Also tried changing system wide encoding nothing worked but this

Now I get the expected output:

Reading a file using utf-8 that is encoded in utf-8 doesn't work, but reading the same file using "windows-1252" or "iso-8859-1" does

2 Answers2