8

In a Java program, I spawn a new Process via ProcessBuilder.

args[0] = directory.getAbsolutePath() + File.separator + program;
ProcessBuilder pb = new ProcessBuilder(args);
pb.directory(directory);
final Process process = pb.start();

Then, I read the process standard output with a new Thread

new Thread() {
    public void run() {
        BufferedReader reader = new BufferedReader(
            new InputStreamReader(process.getInputStream()));
        String line = "";
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
    }
}.start();

However, when the process outputs non-ASCII characters (such as 'é'), the line has character '\uFFFD' instead.

What is the encoding in the InputStream returned by getInputStream (my platform is Windows in Europe)?

How can I change things so that line contains the expected data (i.e. '\u00E9' for 'é')?

Edit: I tried new InputStreamReader(...,"UTF-8"): é becomes \uFFFD

rds
  • 26,253
  • 19
  • 107
  • 134
  • BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8")); – Cris Dec 06 '11 at 10:30
  • @Cris please write an answer rather than a comment, if you want to answer – rds Dec 06 '11 at 10:43

8 Answers8

9

An InputStream is a binary stream, so there is no encoding. When you create the Reader, you need to know what character encoding to use, and that would depend on what the program you called produces (Java will not convert it in any way).

If you do not specify anything for InputStreamReader, it will use the platform default encoding, which may not be appropriate. There is another constructor that allows you to specify the encoding.

If you know what encoding to use (and you really have to know):

new InputStreamReader(process.getInputStream(), "UTF-8") // for example
Thilo
  • 257,207
  • 101
  • 511
  • 656
  • 1
    And as @AlexR points out, the same reasoning applies to writing data, too. – Thilo Dec 06 '11 at 10:31
  • 1
    UTF-8 is the default encoding in Java, so "UTF-8" cannot help. The solution is close, it just needs "Cp1252" or "ISO-8859-1" (depending on what `getInputStream()` returns) – rds Dec 15 '11 at 09:19
  • 2
    UTF-8 is *not* the default encoding in Java. There is no default at all, it always uses something platform dependent (which can be controlled by environment variables and system properties). Not something an application developer should usually rely on. Better to always be explicit in what encoding you want. – Thilo Jan 19 '15 at 11:13
  • UTF-16 is java's standard internal representation of characters. Hence the unsigned 16-bit 'char' primitive. The InputStreamReader will ALWAYS convert to UTF-16. Although the InputStream is a binary stream, if it represents characters the bytes will follow whatever encoding was used to create the resource. The InputStreamReader constructor mentioned by Thilo includes an argument to specify the encoding of that resource - how the stream should be treated. – Matthew Oakley Jun 07 '15 at 12:21
8

Interestingly enough, when running on Windows:

ProcessBuilder pb = new ProcessBuilder("cmd", "/c dir");
Process process = pb.start();

Then CP437 code page works quite well for

new InputStreamReader(process.getInputStream(), "CP437");
jan.supol
  • 2,636
  • 1
  • 26
  • 31
  • As other sais the InputStream contains characters in the platform encoding. Since I have a modern operating system, I have UTF-8; since you have Windows, you have CP437. – rds Aug 08 '15 at 09:40
  • 1
    Thanks, `CP437` was the only charset name that worked for me (Windows + Spanish characters) – IvanRF Oct 01 '15 at 15:32
  • 3
    Actually, nowadays, that should be CP850. The odd thing is that it seems all the windows system is set to windows-1252/cp1252 (at least in western europe), but the console uses CP850 specifically instead. CP437 is the ancestor of CP850. Opening the command prompt and running "chcp" should tell you exactly which encoding is it using to print char data. – Etienne Delavennat Sep 14 '16 at 10:00
  • Also, the encoding to use for parsing the InputStream depends on what program the ProcessBuilder is built around. Let's say for example : CP850 for cmd, windows-1252 for some other windows tools you might invoke directly (without wrapping them in cmd), and possibly UTF-8 if the program you're calling outputs UTF-8. This is program-specific and should be looked up in the program's documentation. – Etienne Delavennat Sep 14 '16 at 10:07
  • 1
    Nice! I have checked some windows 10 settings. For various europian settings, it's CP850, but for defaultians (US settings), it still remains CP437. – jan.supol Nov 24 '16 at 16:07
4

As I understand, an operation system streams are byte-streams, there are no characters here. The InputStreamReader constructor uses jvm default character set java.nio.charset.Charset#defaultCharset(), you could use another constructor to explicitly specify a character set.

kan
  • 28,279
  • 7
  • 71
  • 101
2

According to http://www.fileformat.info/info/unicode/char/e9/index.htm '\uFFFD' is a unicode code for character 'é'. It actually means that you are reading the stream correctly. Your problem is in writing.

Windows console does not support unicode by default. So, if you want to test your code open file and write your stream there. But do not forget to set the encoding UTF-8.

AlexR
  • 114,158
  • 16
  • 130
  • 208
  • **Correct**. new PrintWriter(OutputStreamWriter(..., "Cp1252")) where Cp1252 is the Latin-1 with Windows extension, as used in a small part of western Europe (France, Germany and some). – Joop Eggen Dec 06 '11 at 10:35
  • 1
    Why do you point to character (`0xE9` that I want) when I have character `0xFFFD` aka 'REPLACEMENT CHARACTER' http://www.fileformat.info/info/unicode/char/fffd/index.htm – rds Dec 06 '11 at 10:42
1

Scientific

On Windows this works perfect:

private static final Charset CONSOLE_ENCODING;
static {
    Charset enc = Charset.defaultCharset();
    try {
        String example = "äöüßДŹす";
        String command = File.separatorChar == '/' ? "echo " + example : "cmd.exe /c echo " + example;
        Process exec = Runtime.getRuntime().exec(command);
        InputStream inputStream = exec.getInputStream();
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        while (exec.isAlive()) {
            Thread.sleep(100);
        }
        byte[] buff = new byte[inputStream.available()];
        if (buff.length > 0) {
            int count = inputStream.read(buff);
            baos.write(buff, 0, count);
        }

        byte[] array = baos.toByteArray();
        for (Charset charset : Charset.availableCharsets().values()) {
            String s = new String(array, charset);
            if (s.equals(example)) {
                enc = charset;
                break;
            }
        }
    } catch (InterruptedException e) {
        throw new Error("Could not determine console charset.", e);
    } catch (IOException e) {
        throw new Error("Could not determine console charset.", e);
    }
    CONSOLE_ENCODING = enc;
}

According to specification: there is no hint for runtime-encoding change of jvm. We can not be sure that the encoding does NOT change while running and the charset still correct after such change.

Grim
  • 1,938
  • 10
  • 56
  • 123
  • Hmmm... nice idea, but it actually it doesn't work on my system (Windows 7 SP1, 64-bit, Java 8 build 71) -- none of the available encodings produces the original string. The problem seems to be that the given example string is not even correctly transferred to the system, producing "?" characters instead. Apart of that, I also get an additional "\r\n" endline in the output. – Franz D. Dec 20 '17 at 11:53
1

If you, like me, know in what encoding you want to use for all input/output, you can either encode it in the Java API calls to some (not all) CreateReader methods, which some other answers have pointed out.

But this will hard code it in the source, which might or might not, be ok.

I found a better way after reading this answer which reveals that you can set the encoding before the JVM starts up to what you need.

java -Dfile.encoding=ISO-8859-1 ...
thoni56
  • 3,145
  • 3
  • 31
  • 49
0

use commons-lang jar file in this use - StringEscapeUtils.escapeHtml

BufferedReader br = new BufferedReader(
    new InputStreamReader(StringEscapeUtils.escapeHtml(conn.getInputStream()));
mkl
  • 90,588
  • 15
  • 125
  • 265
0

I put this as a comment but i see there was an answer after ,so it might be redundant now :)

BufferedReader br = new BufferedReader(
    new InputStreamReader(conn.getInputStream(), "UTF-8"));
rds
  • 26,253
  • 19
  • 107
  • 134
Cris
  • 4,947
  • 6
  • 44
  • 73