Facebook/messenger archive contains emoji that I am unable to parse

Question

I can' figure out how to decode facebook's way of encoding emoji in the messenger archive.

Hi everyone, I'm trying to code a handy utility to explore messenger's archive file with PYTHON.

The message's file is a "badly encoded "JSON and as stated in this other post: Facebook JSON badly encoded

Using .encode('latin1').decode('utf8) I've been able to deal with most characters such as "é" or "à" and display them correctly. But I'm having a hard time with emojis, as they seem to be encoded in a different way.

Example of a problematic emoji : \u00f3\u00be\u008c\u00ba

The encoding/decoding does not yield any errors, but Tkinter is not willing to display what the function outputs and gives "_tkinter.TclError: character U+fe33a is above the range (U+0000-U+FFFF) allowed by Tcl". Tkinter is not yet this issue thought because trying to display the same emoji in the consol yields "ó¾º" which clearly isn't what's supposed to be displayed ( it's supposed to be a crying face)

I've tried using the emoji library but it doesn't seem to help any

>>> print(emoji.emojize("\u00f3\u00be\u008c\u00ba"))
'ó¾º'

How can I retrieve the proper emoji and display it? If it's not possible, how can I detect problematic emojis to maybe sanitize and remove them from the JSON in the first place?

Thank you in advance

Possibly related https://stackoverflow.com/questions/52228940/how-can-i-enable-support-for-emoji-in-tkinter-applications — snakecharmerb, Aug 18 '19 at 11:17
This is indeed related, but that's my next problem in the list, thanks for answering it in advance! — Rémi Heitz, Aug 18 '19 at 11:22

T S · Accepted Answer · 2019-08-30T13:08:42.413

.encode('latin1').decode('utf8) is correct - it results in the codepoint U+fe33a(""). This codepoint is in a Private Use Area (PUA) (specifically Supplemental Private Use Area-A), so everyone can assign his own meaning to that codepoint (Maybe facebook wanted to use a crying face, when there wasn't yet one in Unicode, so they used PUA?).

Googling for that char (https://www.google.com/search?q=) makes google autocorrect it to U+1f62d ("") - sadly I have no idea how google maps U+fe33a to U+1f62d.

Googling for U+fe33a site:unicode.org gives https://unicode.org/L2/L2010/10132-emojidata.pdf, which lists U+1F62D as proposed official codepoint.

As that document from unicode lists U+fe33a as a codepoint used by google, I searched for android old emoji codepoints pua. Among other stuff two actually usable results:

How to get Android emoji code point - the question links to :
- https://unicodey.com/emoji-data/table.htm - a html table, that seems to be acceptably parsable
- and even better: https://github.com/google/mozc/blob/master/src/data/emoji/emoji_data.tsv - a tab sepperated list, that maps modern codepoints to legacy PUA codepoints and other information like this:
  1F62D FE33A E72D E411[...]
https://github.com/googlei18n/noto-emoji/issues/115 - this thread links to:
- https://github.com/Crissov/noto-emoji/blob/legacy-pua/emoji_aliases.txt - a machine readable document, that translates legacy PUA codepoints to modern codepoints like this:
  FE33A;1F62D # Google

I included my search queries in the answer, because non of the results I found are in any way authoritative - but it should be enough, to get your tool working :-)

Facebook/messenger archive contains emoji that I am unable to parse

1 Answers1