1

So I have a value in my database which has a non breaking space in the form   in it. I have a legacy service which reads this string from the database and creates an XML using this string. The issue I am facing is that the XML returned for this message is un-parseable. When I open it in notepad++ I see the character xA0 in the place of the non breaking space, and on removing this character the XML becomes parseable. Furthermore I have older revisions of this XML file from the same service which have the character "Â " in place of the non breaking space. I recently changed the tomcat server on which the service was running, and something has gone wrong because of it. I found this post according to which my XML is encoded to ISO-8859-1; but the code which I use to convert the XML to string does not use ISO-8859-1;. Below is my code

private String nodeToString(Node node) {
        StringWriter sw = new StringWriter();

        try {
            Transformer t = TransformerFactory.newInstance().newTransformer();
            t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
            t.transform(new DOMSource(node), new StreamResult(sw));


        } catch (TransformerException te) {
            LOG.error("Exception during String to XML transformation ", te);
        }
        return sw.toString();

    }

I want to know why is my XML un-parseable and why is there a "Â " in the older revisions of the XML file.

Here is the image of the problematic character in notepad++ image in notepad++

Also when I open my XML in notepad and try to save it I see the encoding type is ANSI, when I change it to UTF-8 and then save it the XML becomes parseable.

New Info - Enforcing UTF-8 with transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); did not work I am still getting the xA0 in my XML.

  • Are other characters encoded correctly, or is this a more general encoding problem? (I suspect the latter) – Hulk Feb 03 '21 at 14:44
  • Other characters seem fine, there is only a problem with the non breaking space character. – arielBodyLotion Feb 03 '21 at 14:47
  • So characters like, say, umlauts `ÄÖÜäöü` are working? (Just picking them, because I've got them readily available on my keyboard ;-) depending on your environment, other test strings may be easier to create). – Hulk Feb 03 '21 at 14:50
  • I wonder if enforcing the encoding can help? transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); – JCompetence Feb 03 '21 at 14:56
  • I can't access the DB currently so it won't be possible for me to check these test strings. But the question I would like to highlight again is that why does xa0 make the XML un-parseable? And is xa0 the correct char for nbsp in UTF-8, and what was the encoding that made nbsp "Â " in older revisions? – arielBodyLotion Feb 03 '21 at 15:04
  • @SusanMustafa Sure I can do that, but before that I would like to know what is the issue here, what encoding is currently being used, what was being used before I made the change in the server. Thanks. – arielBodyLotion Feb 03 '21 at 15:09
  • @arielBodyLotion I like how you think. hmm, based on my quick research, this weird A character you are seeing might be the hex representation of a nbsp...   is another synonym, in hex. – JCompetence Feb 03 '21 at 15:20
  • @SusanMustafa Can you have a look at the post I have added in my question, according to that question 0xA0 is ISO-8951 representation of nbsp. Is that correct? Is my XML currently in ISO-8951 encoding? – arielBodyLotion Feb 03 '21 at 15:24
  • Hi @arielBodyLotion I think it might have to do with your tomcat. It defaults to (ISO-8859-1). https://cwiki.apache.org/confluence/display/TOMCAT/Character+Encoding However I believe it is important you put more code samples/perhaps describe the over all flow so we understand how your String is converted into XML , not just Node to String. How does your xml for example? Any encoding set there? Do you use any stylesheets? etc – JCompetence Feb 04 '21 at 08:38
  • The encoding supplied to the transformer's serializer is not going to make any difference because you are sending the transformed output to a StringWriter.It's what you do with the string returned by your `nodeToString()` method that matters. – Michael Kay Feb 04 '21 at 08:51
  • @SusanMustafa the xml header has UTF-8 encoding in the header. And there are no stylesheets in the xml. And I'm using the version 8.5.57 of the tomcat server. – arielBodyLotion Feb 04 '21 at 11:35

1 Answers1

0

The issue was that my version of java was somehow saving my file in ANSI file format. I saw this when I opened my file in notepad, and tried to save it. The older files were in UTF-8 format. So all I did was specify UTF-8 encoding while writing my file.

Writer out = new BufferedWriter(new OutputStreamWriter(
                new FileOutputStream(fileName.trim()), StandardCharsets.UTF_8));
        try {
            out.write(data);
        } finally {
            out.close();
        }