1

I'm trying on Windows 7 to capture the console output of one jar (written with System.out) and write it out as an XML file. This works, but I'm having encoding problems (e.g. with an "ë").

I have this code for reading the console output:

final LinkedList<String> texOutput = new LinkedList<String>();
final Process p = Runtime.getRuntime().exec("java -jar " + absoluteNameOfJar, null, tmpDir);
String line;
final BufferedReader output = new BufferedReader(new InputStreamReader(p.getInputStream(), "Cp1252"));
while ( (line = output.readLine()) != null) {
    texOutput.add(line);
}

And here's the code for writing the LinkedList to XML (using jdom)

if (texOutput.size() > 0) {
    final Element xmlTeXOutput = new Element(XML_ELEMENT_KEY_TEX_OUTPUT);
    for (String line : texOutput) {
         xmlLine = new Element(XML_ELEMENT_KEY_LINE);
         xmlLine.setText(line);
         xmlTeXOutput.addContent(xmlLine);
    }
    genOut.addContent(xmlTeXOutput);
}

With this I get encoding errors in the XML (from the wrongly converted "ë"): "Invalid byte 2 of 3-byte UTF-8 sequence".

I found these questions: How to get console charset?, Java : How to determine the correct charset encoding of a stream - none give me any hope - it seems I have to set the correct encoding for the InputStreamReader, but there seems to be no portable method to find the encoding actually used. Is there a way to fix this?

Oh, and if possible a portable solution should work on MacOS too. And I don't want to set the encoding of the XML to ISO-8859-1 (which seems to be the common work-around according to Google): UTF-8 should work.

EDIT: I write the XML file thusly:

final XMLOutputter xmlOutputter = new XMLOutputter(Format.getPrettyFormat());
final String targetXMLFileName = FilenameUtils.concat(targetDirName, xmlID.getText() + "-out.xml");
final File targetXMLFile = new File(targetXMLFileName);
final FileWriter targetXMLFileWriter = new FileWriter(targetXMLFile);
xmlOutputter.output(xmlOutput, targetXMLFileWriter);
targetXMLFileWriter.close();
Community
  • 1
  • 1
Martin Schröder
  • 4,176
  • 7
  • 47
  • 81
  • Is there a version of the code that I could run and try? Also why do you set Cp1252 encoding for the input stream if you want UTF-8? – Peter Szanto Nov 25 '11 at 15:56
  • @PeterSzanto: Because the input stream is a byte stream in some unknown encoding and must be converted. If I don't set the encoding I get the same error. – Martin Schröder Nov 25 '11 at 16:03

1 Answers1

1

There are a number of potential problems here:

  • "Cp1252" is not the default system encoding that the other application is using with stdout
  • the default encoding is not Unicode (which can cause data loss)
  • there is a transcoding error serializing your DOM to the XML file

Verify that data is being read correctly from the other process. If the default encoding is causing an issue, you may want to write a wrapper app with a main method that sets stdout to a Unicode-encoding stream and then invoke the other main. Then decode within the above code using the same encoding.

There is also a hack involving file.encoding but this may cause unintended side-effects.

If the problem is with serializing the XML it is likely that the data is being written with the wrong encoding even though the declaration is UTF-8. This commonly happens when serializing to a Writer as the serializer does not control the output encoding as it would with an OutputStream.


EDIT

The problem is here:

new FileWriter(targetXMLFile);

From the documentation:

Convenience class for writing character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable.

If you always want UTF-8, construct a stream that writes UTF-8.

McDowell
  • 107,573
  • 31
  • 204
  • 267
  • I tried `Cp850` (which is the encoding according to Windows), but that still gave me the same errors in XML. – Martin Schröder Nov 25 '11 at 16:15
  • 1
    `System.out` would use windows-1252 (aka Cp1252) by default on a Western Windows system (even though the console uses old DOS OEM encodings). I don't have a Mac, but given the number of transcoding bugs that come up, I suspect it is x-MacRoman. Most modern Linux systems use UTF-8. – McDowell Nov 25 '11 at 16:20
  • Thanks, the tip with the FileWriter did it. Do I still have to specify an encoding when reading the console output? – Martin Schröder Nov 25 '11 at 16:33
  • 1
    The encoding using `System.out` and the decoding in your `InputStreamReader` must be symmetrical operations using the same encoding. If the stream producer JVM uses the default encoding, so must the consumer code. The only problem with relying on the default encoding is that it can be lossy - Java strings (UTF-16) and UTF-8 support thousands of code points; windows-1252 supports 256. Unsupported code points will be converted to question marks. – McDowell Nov 25 '11 at 16:39
  • O.K. So it will always _work_ (maybe lossy) if the producer and consumer are on the same platform, but a _proper_ solution would enforce UTF-8 in the producer and consumer (thus leading to possible display problems if the console doesn't support UTF-8). – Martin Schröder Nov 25 '11 at 16:49
  • 1
    Yes, indeed, with the big ‘argh’ being that on Windows, the console default encoding and the Java default character encoding aren't the same, and neither of them are ever UTF-8. – bobince Nov 26 '11 at 16:11