0

I have download my data on facebook to work on and do some statistics with it. unfortunately, some characters doesn't display correctly like 'é' who become 'é'. https://i.stack.imgur.com/7Sh7P.jpg

When i look on the Json side, i have this : https://i.stack.imgur.com/ylGBw.jpg

my code in C# is pretty simple

 private static ListMessages ReadJSON(string path)
    {
        using (StreamReader r = new StreamReader(path, Encoding.GetEncoding("utf-8")))
        {
            string json = r.ReadToEnd();
            ListMessages messages = JsonConvert.DeserializeObject<ListMessages>(json);
            return messages;
        }

    }

I feel like i'm missing something simple but i can't figure what so i hope someone can help or guide me on this subject.

CodeNotFound
  • 22,153
  • 10
  • 68
  • 69
Shrin
  • 1
  • 2
  • Strings in .NET *are* Unicode, specifically UTF16. UTF8 is *not* the same as Unicode, it's just one of the Unicode encodings. If you have such issues it's probably because you used the wrong encoding to load the file. – Panagiotis Kanavos May 02 '18 at 14:29
  • BTW [the default](https://referencesource.microsoft.com/#mscorlib/system/io/streamreader.cs,137) for StreamReader is to use UTF8 *and* try to detect the encoding from BOMs. Remove `Encoding.GetEncoding("utf-8")` completely. Do you still have issues? – Panagiotis Kanavos May 02 '18 at 14:32
  • Finally, `\u009f` etc aren't Unicode characters. They are *escape sequences*. 6 individual characters that are treated as one Unicode character. The same way that `\n` is treated as a newline – Panagiotis Kanavos May 02 '18 at 14:35
  • ANother thing, those escape sequences do *not* correspond to french characters. You don't even need escape sequences to type [è](http://www.fileformat.info/info/unicode/char/e8/index.htm). SO runs on .NET, which is why I can simply type the character without escaping. The Unicode escape sequence for è is `00E8` anyway. – Panagiotis Kanavos May 02 '18 at 14:37
  • Where did this file come from? You should probably ask whoever created it to create a *real* UTF8 file. This looks like an attempt to create a 7-bit ANSI file with all non-ANSI characters replaced by escape sequences. – Panagiotis Kanavos May 02 '18 at 14:47
  • Thank you for the response. I have deleted the Encoding.GetEncoding("utf-8") but i still have the issue. The file come from Facebook directly via the feature "download your facebook data" where i choose Json instead of HTML. And yes, for me, 'é' should be '\u00E9' instead of "\u00c3\u00a9" like facebook send me in the file. – Shrin May 02 '18 at 15:24
  • It's a well known issue, https://stackoverflow.com/q/26614323/60761 – H H May 02 '18 at 19:12
  • Does it exist an equivalent of ftfy for C# or should i use python? – Shrin May 03 '18 at 07:39
  • I've found a solution to my problem here : https://stackoverflow.com/questions/33563179/cant-convert-httpresponsemessage-with-utf8-encoding – Shrin Jul 10 '18 at 08:23

0 Answers0