I have requested and downloaded all my messenger datas from facebook and I wanted to parse the json returned to make some language analysis.
My problem is since i'm french most of my conversations are in french and there are quite a lot of special characters (same for smilies) :
{
"sender_name": "Antoine",
"timestamp_ms": 1493930091160,
"content": "Comment il est \u00c3\u00a9go\u00c3\u00afste :s",
"type": "Generic"
},
Here's an example : in messenger it spells :
"Comment il est égoïste :s"
but if I decode the unicode char using unicode or utf-8 all I get is :
"Comment il est égoïste"
And when I try to write them to the console it crashes it with UnicodeEncodeError.
My tries thus far consisted in a lot of (bad) regexes and replace :
@staticmethod
def fix_special_char2(string):
if isinstance(string, str):
string = string.replace("'", ' ')
string = string.replace('\u00e2\u0080\u0099', " ")
string = string.replace('\u00c3\u00a9', 'e')
string = string.replace('\u00c3\u00af', 'i')
string = string.replace('\u00c3\u0080', 'a')
string = string.replace('\u00c3\u0087', 'c')
string = string.replace('\u00c3\u00aa', 'e')
string = string.replace('\u00c3\u00a0', 'a')
string = string.replace('\u00e2\u009d\u00a4\u00ef\u00b8\u008f', '<3')
string = string.replace('\u00c3\u0089', 'e')
string = string.replace('\u00e2\u0082\u00ac', ' euros')
string = string.replace('\u00c5\u0093', 'oe')
string = string.replace('\u00c3\u0082', 'a')
string = string.replace('\u00c3\u008a', 'e')
string = string.replace('\u00c3\u0089', 'e')
string = string.replace('\u00e2\u009d\u00a4', '<3')
string = string.replace('\u00c3\u0088', 'e')
string = string.replace('\u00c3\u00a2', 'a')
string = string.replace('\u00c3\u00b4', 'o')
string = string.replace('\u00c3\u00a7', 'c')
string = string.replace('\u00c3\u00a8', 'e')
string = string.replace('\u00c2\u00b0', '°')
string = string.replace('\u00c3\u00b9', 'u')
string = string.replace('\u00c3\u00ae', 'i')
string = re.sub('[^A-Za-z ]+', ' ', string)
string = re.sub('\\u00f0(.*){18}', '', string)
string = re.sub('\\u00f3(.*){18}', '', string)
string = re.sub('([aeiu])\\1{1,}', '\\1', string)
string = re.sub('([aA-zZ])\\1{2,}', '\\1\\1', string)
return string
But if I could find the correct encoding it would be far easier and faster (and prettier) there's also a problem whith smilies, but it seems my regexes fail to catch some (especially when they are chained).
Edit : It might rather be a duplicate of : Facebook JSON badly encoded
Rather than the one proposed :)