Converting ASCII literal chars on UTF-8 to special chars

Question

I've found thousands of similar questions over web but none of them with the same problem I've.

I'm using a third party json web api, but the answered json has sometimes special characters that are wrongly printed over HTTP

ex: {"message": "Usu\u00e1rio n\u00e3o encontrado", "status": "fail"}

it shoud be: {"message": "Usuário não encontrado", "status": "fail"}

I've no control on the backend api, and i've tried everything to tell the server to answer me UTF-8, my request has the headers:

Accept: */*;charset=UTF-8
Accept-Charset: UTF-8

but the server keeps answering wrong characters... So i've tried to read the raw http response and decode it by myself

byte[] temp = resp.errorBody().bytes();
errorResponse = new String(temp);
errorResponse = new String(temp,"UTF-8");
errorResponse = new String(temp,"iso-8859-1");
errorResponse = new String(temp,"US-ASCII");
errorResponse = new String(temp,"windows-1252");
errorResponse = new String(temp,"Windows-1251");
errorResponse = new String(temp,"GB2312");
errorResponse = new String(temp,"ISO-8859-2");
errorResponse = new String(temp,"Windows-1250");

I've debuged this code and checked that new assertion still keeps the wrong characters.

So I believe that the backend server produces an iso-8859-1 String and print it literally on an UTF-8 http body.

Again: I've no control over backend code, is there any way i can fix this string on client side?

is the `\u00e1` shown there a string representation of received bytes values, or the actual literal text content of the received message? — Nyerguds, Apr 23 '18 at 16:30
That’s a JSON escape sequence. It’s perfectly valid, and if you use any compliant JSON parser, `Usu\u00e1rio` is identical to `Usuário`, because [00e1 is á](http://www.fileformat.info/info/unicode/char/00e1/index.htm). — VGR, Apr 23 '18 at 16:44
@VGR, i never told that it was an unvalid json, it is just a misrepresented string and i want to store the correct value — Rafael Lima, Apr 23 '18 at 16:59
It is not misrepresented. It does contain the correct value. The escape sequence is a JSON compliant representation of the `á` character. — VGR, Apr 23 '18 at 17:01
It seems a bit strange to want to create your own JSON parser and then argue with the [JSON specification](http://www.json.org/) about equivalent string representations. (Also, you've been putting the server in a bind by asking for something other than UTF-8 when the new [JSON RFC](https://tools.ietf.org/html/rfc8259) says that JSON sent between systems should be encoded with UTF-8.) See this question: [How to parse JSON in Java](https://stackoverflow.com/questions/2591098/how-to-parse-json-in-java). — Tom Blodget, Apr 24 '18 at 00:24

score 2 · Accepted Answer · answered Apr 23 '18 at 16:35

This is just an idea, but I get the impression that your server actually sends these characters:

\
u
0
0
e
1

instead of "á". So I've written the below prototype, and I hasten to say that this is absolutely not production quality code. But could you try what happens if you feed the JSON from your server into it?

package com.severityone.test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class CharTest {

    public static void main(final String... args) {

        final String json = "{\"message\": \"Usu\\u00e1rio n\\u00e3o encontrado\", \"status\": \"fail\"}";
        final Matcher matcher = Pattern.compile("\\\\u([0-9a-z]{4})").matcher(json);
        final StringBuffer result = new StringBuffer();
        while (matcher.find()) {
            matcher.appendReplacement(result, String.format("%c", Integer.valueOf(matcher.group(1), 16)));
        }
        matcher.appendTail(result);
        System.out.println(result.toString());
    }
}

The program gives the following result:

{"message": "Usuário não encontrado", "status": "fail"}

Did you actually test this? It seems to me this will take unicode code points `e1` and `e3`, rather than convert it from the correct character set. — Nyerguds, Apr 23 '18 at 16:37
Yes, it's a direct copy-and-paste from Netbeans. It's more to ascertain whether my impression is correct, and as stated, not as a perfect solution. — SeverityOne, Apr 23 '18 at 16:41
Ohh, I see. The identification of these values as "iso-8859-1" by OP was actually _wrong_; Unicode code points 0x80-0xFF simply seem to largely correspond to iso-8859-1 encoded bytes. — Nyerguds, Apr 23 '18 at 16:51

Converting ASCII literal chars on UTF-8 to special chars

1 Answers1