3

I have requested and downloaded all my messenger datas from facebook and I wanted to parse the json returned to make some language analysis.

My problem is since i'm french most of my conversations are in french and there are quite a lot of special characters (same for smilies) :

{
      "sender_name": "Antoine",
      "timestamp_ms": 1493930091160,
      "content": "Comment il est \u00c3\u00a9go\u00c3\u00afste :s",
      "type": "Generic"
    },

Here's an example : in messenger it spells :

"Comment il est égoïste :s"

but if I decode the unicode char using unicode or utf-8 all I get is :

"Comment il est égoïste"

And when I try to write them to the console it crashes it with UnicodeEncodeError.

My tries thus far consisted in a lot of (bad) regexes and replace :

@staticmethod
def fix_special_char2(string):
    if isinstance(string, str):
        string = string.replace("'", ' ')
        string = string.replace('\u00e2\u0080\u0099', " ")
        string = string.replace('\u00c3\u00a9', 'e')
        string = string.replace('\u00c3\u00af', 'i')
        string = string.replace('\u00c3\u0080', 'a')
        string = string.replace('\u00c3\u0087', 'c')
        string = string.replace('\u00c3\u00aa', 'e')
        string = string.replace('\u00c3\u00a0', 'a')
        string = string.replace('\u00e2\u009d\u00a4\u00ef\u00b8\u008f', '<3')
        string = string.replace('\u00c3\u0089', 'e')
        string = string.replace('\u00e2\u0082\u00ac', ' euros')
        string = string.replace('\u00c5\u0093', 'oe')
        string = string.replace('\u00c3\u0082', 'a')
        string = string.replace('\u00c3\u008a', 'e')
        string = string.replace('\u00c3\u0089', 'e')
        string = string.replace('\u00e2\u009d\u00a4', '<3')
        string = string.replace('\u00c3\u0088', 'e')
        string = string.replace('\u00c3\u00a2', 'a')
        string = string.replace('\u00c3\u00b4', 'o')
        string = string.replace('\u00c3\u00a7', 'c')
        string = string.replace('\u00c3\u00a8', 'e')
        string = string.replace('\u00c2\u00b0', '°')
        string = string.replace('\u00c3\u00b9', 'u')
        string = string.replace('\u00c3\u00ae', 'i')
        string = re.sub('[^A-Za-z ]+', ' ', string)
        string = re.sub('\\u00f0(.*){18}', '', string)
        string = re.sub('\\u00f3(.*){18}', '', string)
        string = re.sub('([aeiu])\\1{1,}', '\\1', string)
        string = re.sub('([aA-zZ])\\1{2,}', '\\1\\1', string)
    return string

But if I could find the correct encoding it would be far easier and faster (and prettier) there's also a problem whith smilies, but it seems my regexes fail to catch some (especially when they are chained).

Edit : It might rather be a duplicate of : Facebook JSON badly encoded

Rather than the one proposed :)

Maxime
  • 818
  • 6
  • 24
  • @CBroe looking at the utf8 table, \u00c3 => Ã \u00a9 => © so how am I supposed to get a 'é' from this ? – Maxime Aug 22 '18 at 10:58
  • Yeah, you’re right, `é` should be encoded as `\u00e9`. Check out https://stackoverflow.com/questions/26614323/in-what-world-would-u00c3-u00a9-become-é, that deals with the same problem. – CBroe Aug 22 '18 at 11:09
  • Thanks, he saw something I did not see : é hex code is \xc3\xa9, and the unicode generated by fb is : \u00c3\u00a9, so what if I just replace every u00 occurence by a \x ? – Maxime Aug 22 '18 at 12:37

1 Answers1

2

I would use the package ftfy to solve this problem https://github.com/LuminosoInsight/python-ftfy

>>> from ftfy import fix_text
>>> fix_text(u'Comment il est \u00c3\u00a9go\u00c3\u00afste :s')
'Comment il est égoïste :s'

I was having problems installing the current version but it worked like a charm with pip install 'ftfy<5'

James
  • 361
  • 2
  • 8
  • I got the newest version working by running `pip install pytest-runner` before `pip install ftfy` – James Aug 22 '18 at 10:55
  • Thanks a lot, I will look into this, I do not mark it as accepted yet because my goal is rather to understand the logic behind the facebook encoding so I can create a simple decoding function without using a whole module which seem overkill in this case (maybe the function i'm looking for is just burried in ftfy code, i'll look for it) – Maxime Aug 22 '18 at 12:40