Display UTF-8 chars on a JVM with default Charset windows-1252

Question

I have a .txt with some kind of "ASCII-Art" that I want to display in the cmd console. The characters are UTF-8, but the default Charset of my JVM is windows-1252. I tried to convert the UTF-8 chars to a bytearray and then convert the bytearray back to a String with set UTF-8 encoding. With this approach, some chars get depicted correctly on the cmd console, but many are just replaced by a '?' and the text stops miraculously in the second line (the "ASCII-Art" is much longer).

One of those UTF-8 chars that were replaced by '?' is '∩'. The program worked fine for my IDE though, because there the default Charset is UTF-8.

Is it possible to write a phrase in my java program that tells the JVM to just switch to UTF-8 Charset for this program? Or what else could I change so I get this piece of Art in my cmd console?

import java.io.File;
import java.util.Scanner;
import java.io.FileNotFoundException;
import java.io.UnsupportedEncodingException;

public class ASCIIfromtxt {
    public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException {

        File artFile = new File("C:/Users/MyName/IdeaProjects/ASCIIArt from Textfile/out/production/ASCIIArt from Textfile/Art.txt");
        Scanner scan = new Scanner(artFile);

        while (scan.hasNextLine()) {
            String nextLineString = scan.nextLine();
            byte[] nextLineBytes = nextLineString.getBytes();
            String win1252Str = new String(nextLineBytes, "UTF-8");
            System.out.println(win1252Str);
        }


    }
}

score 0 · Answer 1 · answered Nov 24 '20 at 05:26

0

I think displaying full range of UTF-8 chars depends on capability of shell that is getting used but in Java you can use following code to set JVM encoding

System.setProperty("file.encoding","UTF-8");

and about your code snippet, the getBytes() method overload has Charset argument like following

public byte[] getBytes(Charset charset)

so if you want to really get bytes of a String in UTF-8 , use this overload

byte[] nextLineBytes = nextLineString.getBytes(Charset.forName("UTF-8"));

answered Nov 24 '20 at 05:26

Saeed Alizadeh

1,417
13
23

Using `String.getBytes` with UTF-8 would lead to undesirable behaviour, because the OP is trying to reconstitute UTF-8 bytes read to string using Cp1252. That means that OP first need to obtain the bytes, so they need to decode the string back to bytes using Cp1252 (the default encoding in this case), and then create a string with those bytes using UTF-8. Note that that scheme will not always work, because some bytes are not mapped in Cp1252. – Mark Rotteveel Nov 24 '20 at 13:05

score 0 · Answer 2 · answered Nov 24 '20 at 13:10

The specific character you use as an example is round-trippable through Cp1252 to UTF-8. The problem is that you're trying to print a UTF-8 character to a console that is character set Cp1252, so the character is not available and will instead be rendered as character ?.

However, your current solution is brittle and will fail if you try to map byte-sequences that are not mapped in Cp1252 (i.e. 0x81, 0x8D, 0x8F, 0x90 and 0x9D) would all map to ? (0x3F), and it would be better to initialize the scanner with the specific character set:

Scanner scan = new Scanner(file, StandardCharsets.UTF_8);

To do what you want to do, you may need to change the codepage of your terminal before running your application (though this doesn't always have the desired effect):

chcp 65001

Display UTF-8 chars on a JVM with default Charset windows-1252

2 Answers2