0

I get xml from third party with encoding UTF-8 and I need to send it to another third party but with ISO-8859-1 encoding. In xml there are many different languages e.g Russian in cyrillic. I know that it's technically impossible to directly convert UTF-8 into ISO-8859-1 however I found StringEscapeUtils.escapeXML() but when using this method then the whole xml is converted even <, > and so on and I would only convert cyrillic to character number reference. Is such method exists in Java or it always parse whole xml? Is there another possibility to parse only characters which can't be encoded in ISO-8859-1 to number format reference?

I've seen similar questions on SO like: How do I convert between ISO-8859-1 and UTF-8 in Java? but it's without mentioning number format reference

Michu93
  • 5,058
  • 7
  • 47
  • 80

1 Answers1

1

UPDATE: Removed unnecessary DOM loading.

Use the XML transformer. It knows how to XML escape characters that are not supported by the given encoding.

Example

Transformer transformer = TransformerFactory.newInstance().newTransformer();

// Convert XML file to UTF-8 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new StreamSource(new File("test.xml")),
                      new StreamResult(new File("test-utf8.xml")));

// Convert XML file to ISO-8859-1 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
transformer.transform(new StreamSource(new File("test.xml")),
                      new StreamResult(new File("test-8859-1.xml")));

test.xml (input, UTF-8)

<?xml version="1.0" encoding="UTF-8"?>
<test>
  <english>Hello World</english>
  <portuguese>Olá Mundo</portuguese>
  <czech>Ahoj světe</czech>
  <russian>Привет мир</russian>
  <chinese>你好,世界</chinese>
  <emoji> </emoji>
</test>

Translated by https://translate.google.com (except emoji)

test-utf8.xml (output, UTF-8)

<?xml version="1.0" encoding="UTF-8"?><test>
  <english>Hello World</english>
  <portuguese>Olá Mundo</portuguese>
  <czech>Ahoj světe</czech>
  <russian>Привет мир</russian>
  <chinese>你好,世界</chinese>
  <emoji>&#128075; &#127758;</emoji>
</test>

test-8859-1.xml (output, ISO-8859-1)

<?xml version="1.0" encoding="ISO-8859-1"?><test>
  <english>Hello World</english>
  <portuguese>Olá Mundo</portuguese>
  <czech>Ahoj sv&#283;te</czech>
  <russian>&#1055;&#1088;&#1080;&#1074;&#1077;&#1090; &#1084;&#1080;&#1088;</russian>
  <chinese>&#20320;&#22909;&#65292;&#19990;&#30028;</chinese>
  <emoji>&#128075; &#127758;</emoji>
</test>

If you replace the test.xml with the test-8859-1.xml file (copy/paste/rename), you still get the same outputs, since the parser both auto-detects the encoding and unescapes all the escaped characters.

Andreas
  • 154,647
  • 11
  • 152
  • 247
  • Thank you @Andreas, I'll test it in second. There is also a chance to write output from `transformer.transform` directly into `String` without temporary files. I found that can use: `new StreamSource(new StringReader(xmlInString)` to put `String` except of file and `new StreamResult(writer));` – Michu93 Apr 14 '20 at 09:27
  • 1
    @Michu93 Yes, look at the javadoc of [`transform()`](https://docs.oracle.com/javase/8/docs/api/javax/xml/transform/Transformer.html#transform-javax.xml.transform.Source-javax.xml.transform.Result-). Input is a `Source` (one of `DOMSource`, `SAXSource`, `StAXSource`, `StreamSource`). Output is a `Result` (one of `DOMResult`, `SAXResult`, `StAXResult`, `StreamResult`). The `StreamXxx` versions have constructors for `File`, `InputStream`, `Reader`, and `File`, `OutputStream`, `Writer`, so you can use `StringReader` and `StringWriter` for pure in-memory processing. It's very flexible. – Andreas Apr 14 '20 at 09:34
  • It's worth to add: `transformerFactory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);` otherwise sonar will shout: https://rules.sonarsource.com/java/RSPEC-4435 – Michu93 Apr 22 '20 at 08:12