Why does UTF-8 encoding not work for special character in process input stream?

Question

I got my last question marked as duplicated as question Which encoding does Process.getInputStream() use?. While actually that's not what I'm asking. In my second example, UTF-8 can successfully parse the special character. However, when the special character is read from the process input stream, it cannot be parsed correctly by UTF-8 anymore. Why does this happen and does that mean ISO_8859_1 is the only option I can choose.

I'm working on a plugin which can retrieve the Azure key vault secret in runtime. However, there's one encoding issue. I stored a string contains special character ç, the string is as follows: HrIaMFBc78!?%$timodagetwiçç99. However, with following program, the special character ç cannot be parsed correctly:

package com.buildingblocks.azure.cli;

import java.io.*;
import java.nio.charset.StandardCharsets;

public class Test {
    static String decodeText(String command) throws IOException, InterruptedException {
        Process p;
        StringBuilder output = new StringBuilder();
        p = Runtime.getRuntime().exec("cmd.exe /c \"" + command + "\"");
        p.waitFor();
        InputStream stream;
        if (p.exitValue() != 0) {
            stream = p.getErrorStream();
        } else {
            stream = p.getInputStream();
        }
        BufferedReader reader = new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8));
        String line = "";
        while ((line = reader.readLine()) != null) {
            output.append(line + "\n");
        }
        return output.toString();
    }

    public static void main(String[] arg) throws IOException, InterruptedException {
        System.out.println(decodeText("az keyvault secret show --name \"test-password\" --vault-name \"test-keyvault\""));
    }
}

The output is: "value": "HrIaMFBc78!?%$timodagetwi��99"

If I use following program to parse the String, the special character ç can be parsed successfully.

package com.buildingblocks.azure.cli;

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

public class Test {
    static String decodeText(String input, String encoding) throws IOException {
        return
                new BufferedReader(
                        new InputStreamReader(
                                new ByteArrayInputStream(input.getBytes()),
                                Charset.forName(encoding)))
                        .readLine();
    }

    public static void main(String[] arg) throws IOException {
        System.out.println(decodeText("HrIaMFBc78!?%$timodagetwiçç99", StandardCharsets.UTF_8.toString()));
    }
}

Both of them are using the BufferedReader with the same setup, but the one parsing the output from process failed. Does anybody know the reason for this?

cmd.exe suggests windows server is used. Are you sure it runs in UTF-8? you should use the character set the platform is using. Actually, java should default to platform-native encoding. — eis, Sep 08 '21 at 15:44
I think use `Charset.defaultCharset()` to get the encoding of the system. Don't use just `ISO_8859_1` on all windows platforms. — the Hutt, Sep 08 '21 at 16:35

score 1 · Answer 1 · answered Sep 08 '21 at 16:10

You are reading with UTF-8

 BufferedReader reader = new BufferedReader(
        new InputStreamReader(stream, StandardCharsets.UTF_8));

Your second example does write the String as UTF-8 so it can be read with the former mentioned code and works well.

But your first example does execute cmd.exe (so Windows OS) and fetches the returned stream data by OS. At Windows you normally have CP1252 as default charset which is not UTF-8.

You could either setup the default character encoding for Windows to UTF-8 - please look at Save text file in UTF-8 encoding using cmd.exe for an HowTo. Or you just use the system encoding of your OS (At Windows normally CP1252) at your input stream reader creation (instead StandardCharsets.UTF_8).

better make system encoding as an application setting or use `Charset.defaultCharset()`. Assuming `CP1252` is not good. — the Hutt, Sep 08 '21 at 16:41
Yes an assumption is not very good - either determine the OS charset programmatically and use it for reading or an application setting as suggested by you - good point. — de-jcup, Sep 09 '21 at 07:35

score 0 · Answer 2 · answered Sep 08 '21 at 15:36

The ç in has two bytes in UTF-8 encoding, so two of them would be four bytes. The two place holder characters � suggest that only two bytes were there. In ISO 8859-1 encoding, a ç has one byte, so this suggests that the encoding was not UTF-8, but may have been ISO 8859-1.

The InputStream does not use any encoding, it just transfers the bytes. The encoding is used in the InputStreamReader.

A hex-dump of the input might be useful. Alternatively, you can try to interpose a script between the Java program and the program you want to call, and analyse the situation there. Or just try with ISO 8859-1 instead.

DuncG · Answer 3 · 2021-09-08T16:04:00.360

The CMD.EXE you launch with ProcessBuilder / Runtime.getRuntime will be sending a stream of the default platform charset. This is not necessarily UTF-8 or the same as your JVM default charset (as you may have changed that with system property -Dfile.encoding=XYZ).

You may be able to determine the charset of the CMD.EXE stream for use in your first method by running CMD.EXE and seeing what value of file.encoding is printed when running JVM without extra parameter:

C:\> java -XshowSettings:properties
Property settings:
...
file.encoding = Cp1252    (or whatever)

score 0 · Answer 4 · answered Sep 08 '21 at 16:02

The charset you select in Java should match the encoding used by the command you execute. It's not UTF-8, and is probably ISO-8859-1. Because the encoding used by the command is likely to default to something different on different machines, you might try setting it explicitly to a known value before executing your command:

chcp 65001 && <command>

Or, in your context:

Runtime.getRuntime().exec("cmd.exe /c \"chcp && " + command + "\"");

Windows code page 65001 is UTF-8.

Note that failing to consume the output of the subprocess can cause it to block, and never terminate, so your waitFor() may block because you consume the output afterward. The standard output of the process may have a large enough buffer to complete, but if there is output to standard error, it is more likely to block. An alternative is to direct standard error to the stderr of the parent Java process.

Why does UTF-8 encoding not work for special character in process input stream?

4 Answers4