4

I've download json with my conversations archive. I stuck with odd encoding.

Example of json:

{
  "sender_name": "Micha\u00c5\u0082",
  "timestamp": 1411741499,
  "content": "b\u00c4\u0099d\u00c4\u0099",
  "type": "Generic"
},

It should be something like this:

{
  "sender_name": "Michał",
  "timestamp": 1411741499,
  "content": "będę",
  "type": "Generic"
},

I'm trying to deserialize it like this:

var result = File.ReadAllText(jsonPath, encodingIn);
JavaScriptSerializer serializer = new JavaScriptSerializer();
serializer.MaxJsonLength = Int32.MaxValue;
var conversation = serializer.Deserialize<Conversation>(System.Net.WebUtility.HtmlDecode(result));

Unfortunately the output is like this:

{
  "sender_name": "MichaÅ\u0082",
  "timestamp": 1411741499,
  "content": "bÄ\u0099dÄ\u0099",
  "type": "Generic"
},

Anyone know how Facebook encoding the json? I've tried several methods but without results.

Thanks for your help.

Lavoriel
  • 91
  • 6
  • Check [How to decode a Unicode character in a string](https://stackoverflow.com/questions/9303257/how-to-decode-a-unicode-character-in-a-string) – Fabjan Jun 11 '18 at 13:48
  • what is encodingIn ? – Prany Jun 11 '18 at 14:22
  • couldnot find your latin characters with encoding that you mentioned - http://etutorials.org/Programming/actionscript/Appendix+A.+Unicode+Escape+Sequences+for+Latin+1+Characters/ – Prany Jun 11 '18 at 15:43
  • That's not encoding, that is Unicode character escaping as defined in the JSON standard: http://www.json.org/ -> https://stackoverflow.com/a/27516892 as well as https://tools.ietf.org/html/rfc7159#section-7. The standard states that in the `\uXXXX` escape sequence, the hex digits `XXXX` correspond to a **Unicode code point**. And U+00C5 really is [LATIN CAPITAL LETTER A WITH RING ABOVE](https://www.fileformat.info/info/unicode/char/00c5/index.htm) so the JSON is being parsed and interpreted correctly. Thus the JSON must have been mangled somehow, can you show how you obtained it? – dbc Jun 11 '18 at 18:30
  • See also https://stackoverflow.com/questions/50008296/facebook-json-badly-encoded – asmaier Jul 03 '18 at 16:11

2 Answers2

5

Here is the answer:

private string DecodeString(string text)
{
    Encoding targetEncoding = Encoding.GetEncoding("ISO-8859-1");
    var unescapeText = System.Text.RegularExpressions.Regex.Unescape(text);
    return Encoding.UTF8.GetString(targetEncoding.GetBytes(unescapeText));
}

I've collect all answers, mixed them and here we are. Thank you.

Lavoriel
  • 91
  • 6
0

Here is the Java equivalent of the answer above for those interested in a Java version. It seems to work well, you pass the entire message text into the method and what comes back is the original message as it was in Messenger before you downloaded this json nightmare that Facebook puts out.

private String decodeString(String text) {
    Charset targetEncoding = Charset.forName("ISO-8859-1");
    String  unescapeText   = StringEscapeUtils.unescapeJava(text);
    return new String(unescapeText.getBytes(targetEncoding), StandardCharsets.UTF_8);
}
Michael Sims
  • 2,360
  • 1
  • 16
  • 29