Similar to in this post, is there a way to decode some of the seemingly malformed UTF-8 data returned from downloading a copy of my Facebook data?
Looking at a particular example, in one of my chats I have a sent message containing only the emoji . Opening the message_1.json
file with vim
and looking at the appropriate entry shows the text "\u00f0\u009f\u0092\u008e"
. However this differs from the view from my terminal (Mac OSX)
$ jq '.messages[0].content' message_1.json
"ð" # stackoverflow seems to be truncating this string, there are 3 extra chars which show as spaces
$ jq '.messages[0].content' message_1.json > utf
$ cat utf
"ð"
$ od -h utf
0000000 c322 c2b0 c29f c292 228e 000a
0000013
$ wc utf
1 1 11 utf
This also differs the output from directly pasting the emoji into a file
$ echo '' > gem.txt
$ cat gem.txt
$ od -h gem.txt
0000000 9ff0 8e92 000a
0000005
$ wc gem.txt
1 1 5 gem.txt
And I get seemingly different information when reading in these two files with python3
$ python3
Python 3.7.3 (default, Dec 13 2019, 19:58:14)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('gem.txt', 'r') as f:
... gem = f.read()
...
>>> gem
'\n'
>>> len(gem)
2
>>> ord(gem[0])
128142
>>>
>>>
>>> with open('utf', 'r') as f:
... utf = f.read()
...
>>> utf
'"ð\x9f\x92\x8e"\n'
>>> len(utf)
7
>>> for char in utf:
... print(ord(char))
...
34
240
159
146
142
34
10
>>>
I have a few questions based on this behavior:
- Is the data returned by Facebook encoded incorrectly? This page shows the proper Unicode pointer for the gem emoji to be
U+1F48E
, and the corresponding UTF-80xF0 0x9F 0x92 0x8E
representation matches with the byte output fromod
- Is there a way for me to parse the returned string from Facebook? It seems that the previous question recommends a regular expression to transform the text before doing so, is this required?
- The
gem.txt
had a length of 5 bytes, and subtracting the newline, 4 bytes to represent the emoji. This makes sense to me as its UTF-8 representation requires 4 bytes. Why does theutf
document list 11 bytes (presumably 10 without the newline)?