0

I'm having some troubles dealing with UTF-8 strings from a JSON. When I perform a POST request to retrieve an UTF-8 JSON in an API, some chars can't be encoded and it gets corrupted.

I set the charset correctly to UTF-8 in the request and most of the accentuated chars are converted correctly, but somehow an "ã" can't be enconded in a certain point of my string.

Example:

I have the string "entendeu que a pretensão de complementação de ações buscada pelos adquirentes de linhas telefônicas deve ter como referência o valor patrimonial da aç��o apurado com base no balancete"

In the example above, in the word "complementação" the problematic char is converted correctly, but in the "aç��o" I have encoding problems.

Can someone help me? Have you already seen it?

Thanks!

EDIT: I used Fiddler 4 to snif my request and when I get the text from the inspector it is fine, but in the VisualStudio 2017 the string is corrupted only in that point.

EDIT 2: I used the HexView from Fiddler and saw that every "ã" comes with the code 0xC3A3, the correct ones and the corrupted one. I'm thinking that something is wrong with the library I'm using to perform the Web Request. I will test some other libraries in order to see if the problems still occurs. Thank you all for your help!

EDIT 3 Found this link: http://www.fileformat.info/info/unicode/char/e3/index.htm Does it help anyone to understand my problem?

EDIT 4 Searching a bit more about the Hex Code I'm receiveing and found this: https://www.fileformat.info/info/unicode/char/c3a3/index.htm. I think the problem is happening when the code tries to convert C3A3 from UTF-8 to UTF-16 (String Enconding for C# in VisualStudio), but I don't know how to make this conversion properly. I'll keep digging and if I find anything else I update here.

  • Maybe [this](https://stackoverflow.com/questions/33102777/how-to-get-a-utf-8-json) will help? – krobelusmeetsyndra Dec 27 '18 at 19:17
  • This is also a good read: [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) – Jens Granlund Dec 27 '18 at 19:58
  • @krobelusmeetsyndra Thanks, but I'm already using Json.Net to deserialize the JSON – Yago Carvalho Dec 28 '18 at 13:23
  • It's very likely that the JSON isn't encoded in UTF-8 to begin with, but in some ANSI code page. Check the hex codes in the problematic characters. – Alejandro Dec 28 '18 at 13:31
  • @Alejandro I just checked the HexView from Fiddler and the Hex code is 0xC3A3 in both "ã" appearences, but only in one of them the encoding error occurs – Yago Carvalho Dec 28 '18 at 16:32
  • @JensGranlund I read the article and it was very clarifying, but didn't solve my problem. The HexCode looks fine but some how it doesn't work in that particular case – Yago Carvalho Dec 28 '18 at 16:44
  • Which library are you using for the WebRequest? – Jens Granlund Dec 29 '18 at 19:13

1 Answers1

0

I tried to encode the string to different encodings and back to UTF-8 but in no way did I manage to get the same decoding error.

That leads me to think that there's something wrong with the original string since it decodes the word "complementação" correctly, could it be that the word "aç��o" is some how wrong or misspelled?

Edit added some tests

I tried to convert all these encodings to UTF-8 an none of them converted some ã correct but not others:

var str = "entendeu que a pretensão de complementação de ações" + 
" buscada pelos adquirentes de linhas telefônicas deve ter como" +
" referência o valor patrimonial da ação apurado com base no balancete";

WriteAsUtf8(str, Encoding.UTF8);
WriteAsUtf8(str, Encoding.UTF7);
WriteAsUtf8(str, Encoding.ASCII);
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-1"));
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-2"));
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-3"));
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-4"));
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-5"));
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-6"));
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-7"));
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-8"));
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-9"));
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-13"));
WriteAsUtf8(str, Encoding.GetEncoding("ISO-8859-15"));
WriteAsUtf8(str, Encoding.GetEncoding("windows-1250"));
WriteAsUtf8(str, Encoding.GetEncoding("windows-1251"));
WriteAsUtf8(str, Encoding.GetEncoding("windows-1252"));
WriteAsUtf8(str, Encoding.GetEncoding("windows-1253"));
WriteAsUtf8(str, Encoding.GetEncoding("windows-1254"));
WriteAsUtf8(str, Encoding.GetEncoding("windows-1255"));
WriteAsUtf8(str, Encoding.GetEncoding("windows-1256"));
WriteAsUtf8(str, Encoding.GetEncoding("windows-1257"));
WriteAsUtf8(str, Encoding.GetEncoding("windows-1258"));

void WriteAsUtf8(string text, Encoding encoding)
{
    var bytes = encoding.GetBytes(text);
    var name = encoding.EncodingName;
    var description = $"{name}:{new string(' ', 35 - name.Length)}";
    Console.WriteLine($"{description}{Encoding.UTF8.GetString(bytes)}");
}

This was the results:

Unicode (UTF-8):               entendeu que a pretensão de complementação de ações buscada pelos adquirentes de linhas telefônicas deve ter como referência o valor patrimonial da ação apurado com base no balancete
Unicode (UTF-7):               entendeu que a pretens+AOM-o de complementa+AOcA4w-o de a+AOcA9Q-es buscada pelos adquirentes de linhas telef+APQ-nicas deve ter como refer+AOo-ncia o valor patrimonial da a+AOcA4w-o apurado com base no balancete
US-ASCII:                      entendeu que a pretens?o de complementa??o de a??es buscada pelos adquirentes de linhas telef?nicas deve ter como refer?ncia o valor patrimonial da a??o apurado com base no balancete
Western European (ISO):        entendeu que a pretens�o de complementa��o de a��es buscada pelos adquirentes de linhas telef�nicas deve ter como refer�ncia o valor patrimonial da a��o apurado com base no balancete
Central European (ISO):        entendeu que a pretensao de complementa�ao de a�oes buscada pelos adquirentes de linhas telef�nicas deve ter como referencia o valor patrimonial da a�ao apurado com base no balancete
Latin 3 (ISO):                 entendeu que a pretensao de complementa�ao de a�oes buscada pelos adquirentes de linhas telef�nicas deve ter como refer�ncia o valor patrimonial da a�ao apurado com base no balancete
Baltic (ISO):                  entendeu que a pretens�o de complementac�o de ac�es buscada pelos adquirentes de linhas telef�nicas deve ter como referencia o valor patrimonial da ac�o apurado com base no balancete
Cyrillic (ISO):                entendeu que a pretensao de complementacao de acoes buscada pelos adquirentes de linhas telefonicas deve ter como referencia o valor patrimonial da acao apurado com base no balancete
Arabic (ISO):                  entendeu que a pretensao de complementacao de acoes buscada pelos adquirentes de linhas telefonicas deve ter como referencia o valor patrimonial da acao apurado com base no balancete
Greek (ISO):                   entendeu que a pretensao de complementacao de acoes buscada pelos adquirentes de linhas telefonicas deve ter como referencia o valor patrimonial da acao apurado com base no balancete
Hebrew (ISO-Visual):           entendeu que a pretensao de complementacao de acoes buscada pelos adquirentes de linhas telefonicas deve ter como referencia o valor patrimonial da acao apurado com base no balancete
Turkish (ISO):                 entendeu que a pretens�o de complementa��o de a��es buscada pelos adquirentes de linhas telef�nicas deve ter como refer�ncia o valor patrimonial da a��o apurado com base no balancete
Estonian (ISO):                entendeu que a pretensao de complementacao de ac�es buscada pelos adquirentes de linhas telefonicas deve ter como referencia o valor patrimonial da acao apurado com base no balancete
Latin 9 (ISO):                 entendeu que a pretens�o de complementa��o de a��es buscada pelos adquirentes de linhas telef�nicas deve ter como refer�ncia o valor patrimonial da a��o apurado com base no balancete
Central European (Windows):    entendeu que a pretensao de complementa�ao de a�oes buscada pelos adquirentes de linhas telef�nicas deve ter como referencia o valor patrimonial da a�ao apurado com base no balancete
Cyrillic (Windows):            entendeu que a pretensao de complementacao de acoes buscada pelos adquirentes de linhas telefonicas deve ter como referencia o valor patrimonial da acao apurado com base no balancete
Western European (Windows):    entendeu que a pretens�o de complementa��o de a��es buscada pelos adquirentes de linhas telef�nicas deve ter como refer�ncia o valor patrimonial da a��o apurado com base no balancete
Greek (Windows):               entendeu que a pretensao de complementacao de acoes buscada pelos adquirentes de linhas telefonicas deve ter como referencia o valor patrimonial da acao apurado com base no balancete
Turkish (Windows):             entendeu que a pretens�o de complementa��o de a��es buscada pelos adquirentes de linhas telef�nicas deve ter como refer�ncia o valor patrimonial da a��o apurado com base no balancete
Hebrew (Windows):              entendeu que a pretens?o de complementa??o de a??es buscada pelos adquirentes de linhas telef?nicas deve ter como refer?ncia o valor patrimonial da a??o apurado com base no balancete
Arabic (Windows):              entendeu que a pretens?o de complementa�?o de a�?es buscada pelos adquirentes de linhas telef�nicas deve ter como refer�ncia o valor patrimonial da a�?o apurado com base no balancete
Baltic (Windows):              entendeu que a pretens?o de complementa??o de a?�es buscada pelos adquirentes de linhas telef?nicas deve ter como refer?ncia o valor patrimonial da a??o apurado com base no balancete
Vietnamese (Windows):          entendeu que a pretens?o de complementa�?o de a�?es buscada pelos adquirentes de linhas telef�nicas deve ter como refer�ncia o valor patrimonial da a�?o apurado com base no balancete
Jens Granlund
  • 4,950
  • 1
  • 31
  • 31
  • 1
    Thanks for the reply! I used Fiddler to snif the request and in the Fiddler it looks ok, both text and HexView (All "ã" are 0xC3A3, the correct ones and the corrupted one). I'm starting to think something is wrong with the library I use to perform the WebRequest. I will test some others to check if the problem still happen. – Yago Carvalho Dec 28 '18 at 16:37
  • You're welcome, sorry it didn't help, I hope you find a solution. – Jens Granlund Dec 28 '18 at 18:02