0

I have a legacy service which returns a XML string from the database. Now for one particular scenario this service returns a string which has the character   in it. I recently shifted this service to a new Windows 10 machine and when I wrote this string to a file, the XML file became un-parseable. On opening the file present on the old Machine in the new machine I saw the file was UTF-8 encoded and on my new machine the file was being written in ANSI. So I started writing the file in UTF-8. The file became parseable now, and was exactly the same as the file on the old machine. But now the issue is that the service is still sending the XML string with the character   in it. But I have started writing the file in UTF-8 and thus the local file has the character "Â " and the String which the service sends has the char xA0. And the logic now compares these two strings and finds a difference, when actually the only difference is in the encoding of these files. Now I am pretty sure that the encoding I want to write the files is in UTF-8, because the files are identical for both machines but how do I convert the String sent by the service such that it is in UTF-8. So that the difference is found only when there is any actual difference. This encoding thing is really confusing for me. Please help me understand what is actually happening here.

Another thing to note here is that the XML file on the old Windows 7 machine shows the encoding ANSI but when I copy that file on my new Windows 10 machine the encoding shows as UTF-8. I check the encoding using notepad(I open the save dialogue). Can someone please help me understand that there was some kind of issue on Windows 7 which was fixed in Windows 10, which is the reason behind the encoding difference between the 2 machines for the same file.

I already asked a question regarding this. I answered my own question as I did solve the parsing issue by writing the file in UTF-8 encoding.

I already tried using below:

byte[] bytes = retVal.getBytes(StandardCharsets.UTF_8);
retVal = new String(bytes, StandardCharsets.UTF_8);

retVal is the string sent by the service. When comparing retVal and the string written to the file, I still get a difference.

This is the code I use to get the string from the service:

        req()
        {
        HttpClient client = new HttpClient() {};
        client.getParams().setParameter("http.useragent", "Service");

            String url = "url";
    
            // Generate Request Body
            String reqBody = generateRequestBody(prarams);
            // Set Appropriate Locale
            PostMethod method = new PostMethod(url);
            method.setRequestBody(reqBody);
    
            String retVal = "";
            // Execute the HTTP Call
            int returnCode = client.executeMethod(method);
    
            if (returnCode == HttpStatus.SC_OK) {
                // Convert response to XML
                DOMParser parser = new DOMParser();
                parser.parse(new InputSource(method.getResponseBodyAsStream()));
                Document doc = parser.getDocument();
                doc.setXmlStandalone(true);
                NodeList nList = doc.getElementsByTagName("tag1");
                Node node = nList.item(0);
    
                // Convert request to String and return
                retVal = nodeToString(node);
    
            }
            return retVal;
          }

    private String nodeToString(Node node){
    StringWriter sw = new StringWriter();

    try {
        Transformer t = TransformerFactory.newInstance().newTransformer();
        t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
        t.transform(new DOMSource(node), new StreamResult(sw));


    } catch (TransformerException te) {
        LOG.info(getStacktraceFromException(te));
        LOG.error("Exception during String to XML transformation ", te);
    }
    return sw.toString();
}

So I tried to fix the encoding at the source, but unfortunately that did not work either. This is my new nodeToString method.

    private String nodeToString(Node node){
        StringWriter sw = new StringWriter();
        String strRepeatString = "";
        try {
            Transformer t = TransformerFactory.newInstance().newTransformer();
            t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
            t.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            StreamResult sr = new StreamResult(new OutputStreamWriter(bos, "UTF-8"));

            t.transform(new DOMSource(node), sr);
            byte[] outputBytes = bos.toByteArray();
            strRepeatString = new String(outputBytes, "UTF-8");

        } catch (TransformerException te) {
            LOG.info(getStacktraceFromException(te));
            LOG.error("Exception during String to XML transformation ", te);
        } catch (UnsupportedEncodingException ex) {
            LOG.info("Error");
        }
          return strRepeatString;
    }

On comparing strRepeatString and the local file saved using UTF-8 encoding(code can be found in the answer of the question ) I am still getting the difference of the char Â.

  • It sounds like you want to convert the [HTML entity](https://developer.mozilla.org/en-US/docs/Glossary/Entity) ` ` (the [non-breaking space](https://www.fileformat.info/info/unicode/char/00a0/index.htm) character) to its UTF-8 equivalent. If so, does this answer your question: [How to convert from HTML to UTF-8 in java](https://stackoverflow.com/questions/2825985/how-to-convert-from-html-to-utf-8-in-java)? – andrewJames Jun 20 '21 at 19:47
  • See also [here](https://stackoverflow.com/questions/17163151/how-to-convert-html-entities-in-java) and [here](https://stackoverflow.com/questions/2141027/converting-html-character-encoding-in-java) and probably similar questions. – andrewJames Jun 20 '21 at 19:47
  • @andrewjames Please take a look at the updated post. – arielBodyLotion Jun 20 '21 at 19:54
  • 2
    This pair of statements: `byte[] bytes = retVal.getBytes(StandardCharsets.UTF_8); retVal = new String(bytes, StandardCharsets.UTF_8);` is a no-op. You convert a String (which is always UTF16) to some UTF8 bytes and convert those UTF8 bytes back to String. – iggy Jun 20 '21 at 20:13
  • How does the code in your update relate to the various solutions in the linked answers? In the first link I mention, I do notice that `org.apache.commons.lang.StringEscapeUtils` has been deprecated in favor of `org.apache.commons.text.StringEscapeUtils`. – andrewJames Jun 20 '21 at 20:14
  • The damage to your data was done by whatever converted the bytes from the 'service' into a Java String. The conversion needs to use the correct encoding, and apparently it did not. Can you make it use the correct encoding? Do you know what it is? (Despite what Windows might say, "ANSI" is not an encoding). – iggy Jun 20 '21 at 20:15
  • 1
    “…the logic now compares these two strings and finds a difference…” What logic, exactly? If you’re using a proper XML parser, ` ` should be present in the parsed String value as a single non-breaking space character and nothing more. – VGR Jun 20 '21 at 20:39
  • @iggy I think what you are saying makes a lot of sense. I have added the code used to create the xml String from the service response. Do you think method.getResponseBodyAsStream() should be encoded? Please check updated post. – arielBodyLotion Jun 20 '21 at 20:47
  • @VGR I have updated the post with how I get the fresh xml string from the service and the logic I have stated above is to compare this string from service with the string in the locally written file using StringUtils.indexOfDifference. – arielBodyLotion Jun 20 '21 at 20:49
  • @here Sorry for the bad indentation, I'll fix it. Just wanted to post the code for a quick response. – arielBodyLotion Jun 20 '21 at 20:50
  • @andrewjames I saw all the links. But I am more convinced by iggy's comment. You should also look at that once IMO. Thanks. – arielBodyLotion Jun 20 '21 at 20:57
  • @iggy You might also want to look at the solution of the question I asked before, it is also linked in the post. The encoding I want is UTF-8. It is the encoding I write the xml string in. – arielBodyLotion Jun 20 '21 at 21:04
  • I’m pretty sure your nodeToString method will return something like `"value"`. Is that really what you want? Or do you just want the string value of the tag1 element’s content (in this example, `"value"`)? – VGR Jun 20 '21 at 21:10
  • @VGR No I want "value". The only problem is the encoding, everything else is fine. – arielBodyLotion Jun 20 '21 at 21:12
  • 1
    Okay, so you want XML markup. In that case, `hello world` is exactly the same as `hello world`, `hello world`, `hello world`, `hello world`, and `hello world` (that last one contains a non-breaking space, not an ASCII space). That is simply how XML works. You have to use an XML parser if you want to perform a reliable comparison. – VGR Jun 20 '21 at 21:47
  • I just noticed your edit comment [here](https://stackoverflow.com/posts/68059299/timeline#history_1b356d1c-81bd-4da7-8754-9eb2f5a2e93f). Can you clarify what "_does not work_" means? If you have a string containing `hello world`, more than one of the linked answers will convert it to `hello world`, where the space between `hello` and `world` is a non-breaking space. (But **100% yes**, I agree that it would be better to address the issue "at source", as mentioned by others.) – andrewJames Jun 20 '21 at 23:44
  • So the nbsp is when written using UTF-8 encoding becomes the char  . Now there is a difference between hello world and helloÂworld. And you are absolutely right that XML comparison should be done using an XML parser. But I don't think that will solve my issue. – arielBodyLotion Jun 21 '21 at 05:57
  • @andrewjames I have tried to address the encoding issue while converting the node to a string. But that is still not working correctly. Please look at the updated post. – arielBodyLotion Jun 21 '21 at 06:02
  • I take your point. Sorry for the unhelpful suggestions. – andrewJames Jun 21 '21 at 12:37
  • @iggy Any comments on why the solution in my updated post does not do the job? – arielBodyLotion Jun 21 '21 at 17:54
  • Nope; I don't see anywhere obvious where bytes are converted to a Java String. – iggy Jun 21 '21 at 23:36

0 Answers0