What is happening here? Why when I read the file using utf-8 does it output questionmarks in the console?
This is a minimal working example:
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import static org.apache.commons.io.FileUtils.readFileToString;
import static org.apache.commons.io.FileUtils.writeStringToFile;
public class Main {
public static void main(String... args) throws IOException {
System.out.println("---------");
System.out.println(Charset.defaultCharset());
System.out.println("æ ø å");
System.out.println("æ ø å");
System.out.println("æ ø å");
File inputFile = new File(System.getProperty("user.dir") + "/input.md");
File outputFile = new File(System.getProperty("user.dir") + "/output.md");
String content, encoding;
System.out.println("--------- windows-1252");
encoding = "windows-1252";
content = readFileToString(inputFile, encoding);
System.out.println(content);
System.out.println("--------- iso-8859-1");
encoding = "iso-8859-1";
content = readFileToString(inputFile, encoding);
System.out.println(content);
System.out.println("--------- utf-8");
encoding = "utf-8";
content = readFileToString(inputFile, encoding);
System.out.println(content);
writeStringToFile(outputFile, content, encoding);
}
}
Where input.md
contains: (encoded in UTF-8)
This is input.md. 'æ' 'ø' 'å'
Running the above code yields
---------
windows-1252
æ ø å
æ ø å
æ ø å
--------- windows-1252
This is file C. 'æ' 'ø' 'å'.
--------- iso-8859-1
This is file C. 'æ' 'ø' 'å'.
--------- utf-8
This is file C. '�' '�' '�'.
Why do I get �
when I read the file using UTF-8? This is especially weird since the file is encoded in UTF-8.
UPDATE: My console is set to "UTF-8":
Here is a screenshot of the hex values of each char in string extracted from the input file:
Here is a better screenshot of the hex isolated: