3

Currently, I'm facing an issue with uploading (using python) EMOJI data to the BIG QUERY

This is sample code which I'm trying to upload to BQ:

 {"emojiCharts":{"emoji_icon":"\ud83d\udc4d","repost": 4, "doc": 4, "engagement": 0, "reach": 0, "impression": 0}} 
 {"emojiCharts":{"emoji_icon":"\ud83d\udc49","repost": 4, "doc": 4, "engagement": 43, "reach": 722, "impression": 4816}} 
 {"emojiCharts":{"emoji_icon":"\u203c","repost": 4, "doc": 4, "engagement": 0, "reach": 0, "impression": 0}} 
 {"emojiCharts":{"emoji_icon":"\ud83c\udf89","repost": 5, "doc": 5, "engagement": 43, "reach": 829, "impression": 5529}} 
 {"emojiCharts":{"emoji_icon":"\ud83d\ude34","repost": 5, "doc": 5, "engagement": 222, "reach": 420, "impression": 2805}} 
 {"emojiCharts":{"emoji_icon":"\ud83d\ude31","repost": 3, "doc": 3, "engagement": 386, "reach": 2868, "impression": 19122}} 
 {"emojiCharts":{"emoji_icon":"\ud83d\udc4d\ud83c\udffb","repost": 5, "doc": 5, "engagement": 43, "reach": 1064, "impression": 7098}} 
 {"emojiCharts":{"emoji_icon":"\ud83d\ude3b","repost": 3, "doc": 3, "engagement": 93, "reach": 192, "impression": 1283}} 
 {"emojiCharts":{"emoji_icon":"\ud83d\ude2d","repost": 6, "doc": 6, "engagement": 212, "reach": 909, "impression": 6143}} 
 {"emojiCharts":{"emoji_icon":"\ud83e\udd84","repost": 8, "doc": 8, "engagement": 313, "reach": 402, "impression": 2681}} 
 {"emojiCharts":{"emoji_icon":"\ud83d\ude18","repost": 7, "doc": 7, "engagement": 0, "reach": 8454, "impression": 56366}} 
 {"emojiCharts":{"emoji_icon":"\ud83d\ude05","repost": 5, "doc": 5, "engagement": 74, "reach": 1582, "impression": 10550}} 
 {"emojiCharts":{"emoji_icon":"\ud83d\ude04","repost": 5, "doc": 5, "engagement": 73, "reach": 3329, "impression": 22206}}

Issues is that big query cannot see any of this emoji (\ud83d\ude04) and will display only in this format (\u203c)

Even if the field is STRING it displays 2 black rombs, why BQ cannot display emoji as a string without converting it to the actual emoji?

Questions:

Is there are any way to upload EMOJI to Big Query that it will load up correctly? - "will be used in Google Data Studio"

Should I manually (hardcoded) change all emoji code the acceptable ones, which is the acceptable format?

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
  • 1
    The issue is with how the BigQuery UI *displays* the data, not with how BigQuery *stores* the data, is that right? You can check the strings with the `TO_CODE_POINTS` function. – Elliott Brossard Sep 04 '18 at 16:15
  • Check out https://www.charbase.com/1f618-unicode-face-throwing-a-kiss What you want is to convert the javascript escape characters to actual unicode data. Check out https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python – numeral Sep 04 '18 at 16:31

2 Answers2

2

As user 'numeral' mentions in their comment:

Check out charbase.com/1f618-unicode-face-throwing-a-kiss What you want is to convert the javascript escape characters to actual unicode data.

, you need to change the encoding of the emojis for them to be accurately represented as one character:

SELECT "\U0001f604 \U0001f4b8"
--   , "\ud83d\udcb8"
--   , "\ud83d\ude04"

The 2nd and 3d line fail with an error like Illegal escape sequence: Unicode value \ud83d is invalid at [2:7], but the first line gives the correct display in BigQuery and Data Studio:

enter image description here

enter image description here

Additional thoughts about this:

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • If this emojis are part of text file (modified JSON), what is the best way of change the encoding of the emojis? –  Sep 05 '18 at 07:08
  • 1
    that's an interesting question for which I don't have the answer - take a look at the related existing questions, or post a new one (probably related to file encoding, and not BigQuery specific) – Felipe Hoffa Sep 05 '18 at 07:12
  • I had a look, was checking google, here... in a lot of places, but there are not many people working with Big Querry, do not have any helpful answers... I just worried to post another question, as will be banned from asking, you sure it will be fine and the new question will not be duplicated if I will refer to this one? –  Sep 05 '18 at 07:15
  • I think your question is not related to BigQuery, but how to produce JSON that has the actual emoji unicode, instead of strings that show the escaped sequences. You will need to show how you created these files and the actual files (upload somewhere?) – Felipe Hoffa Sep 05 '18 at 07:25
  • by any chance, could you let me know you, by using this emoji code `\U0001f604` have them display, as my one display only text without forward slash??? –  Sep 06 '18 at 06:59
1

Python does not support "surrogate characters" representation which is composed of multiple UTF-16 characters and some emojis (over 0xFFFF) use them. For example, can be represented by \U0001f3e6 (UTF-32) in Python and some languages uses \ud83c\udfe6. For those values are less than 0xFFFF, python and other languages both use the same representation, e.g. \u3020 (〠). To solve the encoding issue, you can manually convert the emoji characters or consider using some libraries, e.g. https://github.com/hartwork/surrogates to convert them to UTF-32.

Also, BigQueqry Python client's load_table_from_json had a bug about those characters whose values are over 0xFFFF, even you use correct UTF-32 representation. It just released a new version to fix it a couple of days ago. ref: https://github.com/googleapis/python-bigquery/releases/tag/v2.24.0

Some references about Bank emoji listing different representation:

Grimmer Kang
  • 235
  • 3
  • 10