Decode seemingly malformed utf8 representation of Unicode pointer from Facebook data export

Question

Similar to in this post, is there a way to decode some of the seemingly malformed UTF-8 data returned from downloading a copy of my Facebook data?

Looking at a particular example, in one of my chats I have a sent message containing only the emoji . Opening the message_1.json file with vim and looking at the appropriate entry shows the text "\u00f0\u009f\u0092\u008e". However this differs from the view from my terminal (Mac OSX)

$ jq '.messages[0].content' message_1.json
"ð"  # stackoverflow seems to be truncating this string, there are 3 extra chars which show as spaces
$ jq '.messages[0].content' message_1.json > utf
$ cat utf
"ð"
$ od -h utf
0000000      c322    c2b0    c29f    c292    228e    000a
0000013
$ wc utf
       1       1      11 utf

This also differs the output from directly pasting the emoji into a file

$ echo '' > gem.txt
$ cat gem.txt

$ od -h gem.txt
0000000      9ff0    8e92    000a
0000005
$ wc gem.txt
       1       1       5 gem.txt

And I get seemingly different information when reading in these two files with python3

$ python3
Python 3.7.3 (default, Dec 13 2019, 19:58:14)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('gem.txt', 'r') as f:
...   gem = f.read()
...
>>> gem
'\n'
>>> len(gem)
2
>>> ord(gem[0])
128142
>>>
>>>
>>> with open('utf', 'r') as f:
...   utf = f.read()
...
>>> utf
'"ð\x9f\x92\x8e"\n'
>>> len(utf)
7
>>> for char in utf:
...   print(ord(char))
...
34
240
159
146
142
34
10
>>>

I have a few questions based on this behavior:

Is the data returned by Facebook encoded incorrectly? This page shows the proper Unicode pointer for the gem emoji to be U+1F48E, and the corresponding UTF-8 0xF0 0x9F 0x92 0x8E representation matches with the byte output from od
Is there a way for me to parse the returned string from Facebook? It seems that the previous question recommends a regular expression to transform the text before doing so, is this required?
The gem.txt had a length of 5 bytes, and subtracting the newline, 4 bytes to represent the emoji. This makes sense to me as its UTF-8 representation requires 4 bytes. Why does the utf document list 11 bytes (presumably 10 without the newline)?

score 4 · Accepted Answer · answered Jun 29 '20 at 06:04

4

It looks like the content of your JSON file indeed got mojibaked, ie. misinterpreted with the wrong encoding.

>>> import json
>>> # That's how it should look:
>>> print(json.dumps(''))
"\ud83d\udc8e"
>>> # That's what you got:
>>> mojibaked = ''.encode('utf8').decode('latin1')
>>> print(json.dumps(mojibaked))
"\u00f0\u009f\u0092\u008e"

Check if you can fix how the JSON file is created. Latin-1 is the default in some tools/protocols. The convenient thing is that you can always decode any stream of bytes as Latin-1 without exceptions. It might corrupt your input though, as happens here.

If you can't fix the source, you might be able to recover by doing the encoding round-trip in reverse:

>>> mojibaked = json.loads('"\\u00f0\\u009f\\u0092\\u008e"')
>>> mojibaked
'ð\x9f\x92\x8e'
>>> mojibaked.encode('latin1').decode('utf8')
''

answered Jun 29 '20 at 06:04

lenz

5,658
5
24
44

1

I think I addressed Questions 1 and 2. Answering 3 is left as an exercise to the OP. ;-) – lenz Jun 29 '20 at 07:39
Thanks for the reply. Unfortunately short of becoming a Facebook developer and fixing this myself, I'm not sure it's likely to be fixed at the source . I was actually able to find out a way of manually decoding the bytes into the emoji with the following function `bytes(list(map(ord, string))).decode('utf-8')` Is this internally what encoding with `latin1` is internally doing? As far as question 3, I believe I can solve the exercise! The size of 11 is comprised of 1 byte from newline, 2 bytes from quotes, and 8 remaining bytes with 2 per `\uabcd` character above. – User Jun 30 '20 at 00:03
Yes, `bytes(map(ord, string))` is equivalent to `string.encode('latin1')`. That's because Latin-1 is a 1-to-1 mapping of the first 256 Unicode code points to the corresponding (unsigned) value of a single byte. – lenz Jun 30 '20 at 08:11

Decode seemingly malformed utf8 representation of Unicode pointer from Facebook data export

1 Answers1