3

Does anyone know how Facebook encodes emoji with high-surrogate pairs in the Graph API?

Low surrogate pairs seem fine. For example, ❤️ (HEAVY BLACK HEART, though it is red in iOS/OSX, link to image if you can't see the emoji) comes through as \u2764\ufe0f which appears to match the UTF-16 hex codes / "Formal Unicode Notation" shown here at iemoji.com.

And indeed, in Ruby when parsing the JSON output from the API:

ActiveSupport::JSON.decode('"\u2764\ufe0f"')

you correctly get:

"❤️"

However, to pick another emoji, (SLEEPING SYMBOL, link to image here. Facebook returns \udbba\udf59. This seems to correspond with nothing I can find on any unicode resources, e.g., for example this one at iemoji.com.

And when I attempt to decode in Ruby using the same method above:

ActiveSupport::JSON.decode('"\udbba\udf59"')

I get:

""

Any idea what's going on here?

philoye
  • 2,490
  • 28
  • 23
  • 1
    `\u2764\ufe0f` isn't a surrogate pair, it's a normal Basic Multilingual Plane character followed by a variation selector. Using a variant to try to distinguish when emoji should be rendered as colour icons is an ugly new addition in Unicode 6.2. `\udbba\udf59` does seem to be an error though... the corresponding codepoint U+FEB59 is Private Use character that you shouldn't be getting. – bobince Nov 18 '13 at 11:54
  • There are no "high-surrogate pairs" and no "low surrogate pairs". Valid surrogate pairs (in UTF-16) are composed of one low surrogate and one high surrogate (in that order). None of the characters in your first example is a surrogate. – R. Martinho Fernandes Nov 18 '13 at 12:42
  • I clearly don't understand this well enough to use the right language. Any emoji character where the unicode looks like `U+2764` works. But one that looks like `U+1F4A4` (note the 1) does not. – philoye Nov 18 '13 at 20:10
  • @bobince U+FEB59 is a clue. It is the "Google" encoding according to this page on [unicode.org](http://www.unicode.org/~scherer/emoji4unicode/snapshot/utc.html#e-B59). Is that the answer then, for "proposed" encodings per that table, Facebook is using the "Google" version instead? – philoye Nov 19 '13 at 07:08
  • @philoye: Ahh! Good catch, I had completely forgotten about the temporary code points in that proposal... weird that anyone is actually using it, but I guess Facebook must just have been implementing emoji at that uncertain period before Unicode 6.0. You can grab the mapping data from [emoji4unicode](http://emoji4unicode.googlecode.com/svn/trunk/data/emoji4unicode.xml) – bobince Nov 19 '13 at 10:00
  • @bobince Sweet. Thanks for the mapping data, though I confess I don't understand how you get from `\udbba\udf59` to `U+FEB59`. Is there a resource you know that educate me on the wonders of this particular aspect of unicode? – philoye Nov 19 '13 at 10:12
  • @philoye: Ah, *that's* the bit that's a surrogate pair: the two UTF-16 code units 0xDBBA,0xDF59 represent the code point U+FEB59. JavaScript/JSON strings are based on UTF-16 code units, not characters (unfortunately). Ruby strings don't have that problem, so decoding the JSON `\uDBBA\uDF59` to a single character `` is in fact the right thing to do... you then would just need to fix up the weird Google emoji into standard Unicode ones. Your choice whether or not you care about including a U+FE0F variant selector. – bobince Nov 19 '13 at 10:23
  • @bobince Thanks for your help on this! I don't understand how one converts code units `0xDBBA,0xDF59` into code point `U+FEB59`. Is there a ruby incantation I can use? I understand that I would then need to convert that that to the proper `U+1F4A4` using a mapping table. – philoye Nov 21 '13 at 00:56
  • @philoye: the UTF-16 surrogate pair to char conversion is fairly straightforward: `((lead&0x3FF)<<10))+(trail&0x3FF)`, see [wiki](http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF) for background. But any compliant JSON decoder should have already converted the string literal `"\uDBBA\uDF59"` to a single character for you (in whatever encoding you are using for strings, presumably UTF-8). You then only need to do a normal string replace (eg from `"\u{FEB59}"` to `"\u{1F4A4}"`). – bobince Nov 21 '13 at 11:15
  • Sweet, thanks @bobince for all your help on this. – philoye Nov 21 '13 at 21:42

1 Answers1

2

Answering my own question though most of the credit belongs to @bobince for showing me the way in the comments above.

The answer is that Facebook encodes emoji using the "Google" encoding as seen on this Unicode table.

I have created a ruby gem called emojivert that can convert from one encoding to another, including from "Google" to "Unified". It is based on another existing project called rails-emoji.

So the failing example above would be fixed by doing:

string = ActiveSupport::JSON.decode('"\udbba\udf59"')
> ""
fixed = Emojivert.google_to_unified(string)
> ""
philoye
  • 2,490
  • 28
  • 23
  • Table link is 404 :/ – Fernando Montoya Sep 05 '18 at 15:57
  • Actually 403 Forbidden, not sure why. I can't track down the original anywhere else, alas. I wonder if Facebook is still using this encoding. – philoye Sep 10 '18 at 10:42
  • @philoye Take a look at [this](https://github.com/musalbas/NoisyTweets/blob/master/static/emoji-data/README.md). Also, based on the last comment to [this question](https://stackoverflow.com/q/50008296/5352605), I don't think they use it anymore. – Samuele Pilleri Nov 15 '18 at 09:55